1 Introduction

Ever-increasing performance and reliability requirements on electronic systems are among the key factors that drive the evolution of integrated circuit technology. Multiprocessing is a promising solution to meet the requirements of near future applications. To get full benefit from parallel processing, a multiprocessor needs efficient on-chip communication architecture [7]. Network-on-Chip (NoC) is a general purpose on-chip communication concept that offers high throughput and keeps complexity in check by regular composition of basic building blocks. A typical NoC based system consists of processing elements (PEs), network interfaces (NIs), routers and channels (links). Each router contains a switch, buffers and routing logic as shown in Fig. 1. All links can be used simultaneously for data transmission, which provides a high level of parallelism. NoCs provide better scalability than on-chip buses because as more resources are introduced to a system, also more routers and links are introduced to connect them to the network. The additional links and routers provide the communication capacity needed for the new resources. Thus, the NoC concept is an attractive solution to replace conventional communication architectures such as shared buses or point-to-point dedicated links.

Fig. 1
figure 1

Typical virtual channel router architecture [22]

It has been demonstrated that buffers consume the largest fraction of dynamic and leakage power of the NoC node (router + link) [3,4]. Packet buffers consumes far more power than packet transmission [37]. Thus, increasing the utilization of buffers and reduction in their number and size reduces area and power consumption. Wormhole switching [35] has been proposed to reduce buffer requirements and to enhance system throughput. However, one packet may occupy several intermediate switches at the same time. To avoid potentially resulting deadlocks, virtual channels (VCs) are introduced. A typical VC architecture [22] for the input port of a router is shown in Fig. 2. Virtual channel flow control employs an array of buffers at each input port. By allocating each packet to an individual buffer, flits from multiple packets may be sent in an interleaved manner over a single physical channel. This improves the throughput and reduces the average packet latency by allowing blocked packets to be bypassed.

Fig. 2
figure 2

Typical virtual channel input port architecture

A well designed network exploits available resources to improve performance while incurring minimum overhead [2]. Buffer utilization can be enhanced by sharing buffers among ports. Router architectures with full buffer sharing can deliver high throughput, but this comes at the expense of area and power consumption. Hence, it is necessary to devise a technique that enables judicious tradeoff between performance, power and area.

In nanometer technologies, devices are exposed to a large number of noise sources such as capacitive and inductive crosstalk, power supply noise, leakage noise, thermal noise, process variations, charge sharing, and soft errors. This reduces the reliability of the manufactured devices [18,28]. There are a number of fault tolerant solutions which deal with reliability at different abstraction levels, for example routing algorithms [21,38], architectures [24,36], and error control coding schemes [11,32]. Some of the fault tolerant NoC architectures proposed use intelligent routing algorithms [8,9]. The major drawback of this approach is that the fault-free resources which are interconnected with the faulty resource cannot be used. This leads to a reduction in system performance. For instance, if there is a link failure in a VC based NoC, the VC buffers connected to the failed link cannot be used. To reduce the effect of faults on system performance, such unused resources should be utilized by the system. A well designed network exploits all available resources to sustain performance [2].

In this paper, we propose a novel architecture for NoC routers by sharing the VC buffers. With minimal overhead, it enhances the utilization of resources especially in the presence of faults and thereby retains the required performance. This is achieved by reconfiguring the usage of resources that become inaccessible because of fault.

The rest of the paper is organized as follows. Section 2 gives an overview of the related work. Section 3 presents our analysis of link load in NoC based systems based on both synthetic and real benchmarks. In Section 4, fault scenarios for NoC routers are discussed and the impact of faults on performance and utilization of other resources is investigated. The proposed partial virtual channel sharing (PVS) architecture is presented in Section 5. Section 6 explains how this architecture sustains system performance under faults. Finally, experimental results are presented and conclusions are drawn.

2 Related Work

As the presented work addresses resource management to improve performance in the presence of faults, the discussion of related work focuses on resource management architectures and fault tolerant techniques.

2.1 Buffer Management Techniques for Bandwidth Utilization

There have been many efforts to enhance resource utilization in NoC based systems. Lan et al. [16] address buffer utilization by making the channels bidirectional and show significant improvement in system performance. In this case, each channel controller has two additional tasks: dynamically configuring the channel direction and allocating the channel to one of the routers. These additional tasks make the controller circuit complex. There is a 40 % area overhead over the typical NoC router architecture due to double crossbar design and control logic. This also causes additional power consumption.

Nicopoulos et al. [27] present a Dynamic Virtual Channel Regulator (ViChaR) for NoC routers. The authors improve buffer utilization by using the unified buffered structure (UBS) instead of individual, statically partitioned FIFO buffers. UBS provides each router port with a variable number of VCs, depending to the traffic load. The architecture achieves around 25 % improvement in system performance at a small cost of power consumption. However, the architecture enhances buffer utilization only when a port is under heavy traffic load. If there is no load, the buffer resources cannot be utilized by neighboring overloaded ports.

Ramanujam et al. [31] introduce a distributed shared buffer (DSB) NoC router architecture. The proposed architecture shows a significant improvement in throughput at the expense of area and power consumption due to its extra crossbar and complex arbitration scheme. The authors do not address fault tolerance. If the centralized control logic is affected by a fault, all resources of the router become useless.

2.2 Fault Tolerance Techniques

Faults can be categorized as permanent, intermittent, and transient [6]. Different techniques are required to deal with different kinds of faults. Neishabouri et al. [25] propose the Enhanced Reliability Aware Virtual Channel (ERAVC) architecture for NoC. ERAVC enables dynamic VC allocation and reliability aware sharing among input channels. More memory is allocated to the busy channels and less to the idle channels. In addition, ERAVC uses fault-tolerant flow control which allows packet retransmission without requiring the extra buffers. ERAVC shows significant reduction in Average Packet Latency (APL) for normal system operation at the expense of complex memory control logic. If a router node is marked faulty, the approach balances the traffic load well. However, the approach cannot utilize the intact resources of a partially-faulty router.

Fick et al. [10] devise a strategy to utilize the inherent redundancy at network and router level to maintain correct operation. In the proposed Vicis architecture, the network layer is reconfigured by swapping ports so that defective ports come together as pairs on the same link, thereby increasing the number of usable links with two intact ports. Router level reconfiguration is used to tolerate internal faults of a router which are not visible at the network level. A crossbar bypass bus is used to tolerate crossbar failure. Error correction coding (ECC) is used to protect data path elements. Each router uses built-in-self-test (BIST) to diagnose the exact locations of hard faults, intended for better utilization of ECC, port swapper and crossbar bypass bus. To minimize the overhead, the port swappers do not need to be fully connected, i.e., not every port needs to be connected to every physical link. The link to the local network adapter is able to connect to three different input ports and other links are able to connect to two input ports. For swapping to be still effective, the pair of failed input and output ports must belong to the same swapping group.

Concatto et al. [5] present a highly reconfigurable fault tolerant NoC router architecture. The architecture can dynamically stop using a faulty flit buffer unit and instead borrow the flit buffer units from the neighboring channels to sustain performance. However, the fine-granular bypassing and borrowing of flit buffer units makes the control logic very complex. Thus, the solution is not area and power efficient.

In typical NoC architectures, a fault in a router or in a network interface (NI) results in an unconnected resource. Lehtonen et al. [19] achieve fault tolerance in such situations by introducing multiple-NI architectures. This approach improves the system fault tolerance on topology level. The throughput performance can also be enhanced by utilizing multiple routes and reducing the number of communication hops. However, the solution has significant area overhead, and it is not power efficient for synchronous systems unless power gating is introduced for all NIs. Zonouz et al. [39] propose a dual connected mesh structure (DCS) like [19] and face similar problems.

A lightweight fault tolerant mechanism for NoCs based on default backup paths has been proposed by Koibuchi et al. [15]. These backup paths can maintain the connectivity of healthy routers and processing cores in presence of the faults. However, the critical path for packet transmission increases with the number of faulty routers on the transmission path of the packet. Moreover, the packet is transferred via both, the intermediate routers and their local PEs. To avoid the transmission overhead via PEs, additional logic is required which increases the area and power overhead. Another issue with the proposed architecture is its scalability. The authors claim that all the PEs can still be connected even in case of failure of all the routers. In this case, the toplogy becomes a ring which cannot meet the bandwidth requirements of hundreds of connected cores.

Lotfi-Kamran et al. [21] present a decision making routing algorithm to avoid congestion in 2D NoC architectures. In addition, the proposed dynamic routing approach can tolerate a single link failure. However, the resources connected to a faulty link, e.g. VC buffers and control logic, cannot be utilized by the proposed technique. A similar issue occurs with other fault tolerant routing techniques.

We propose a NoC architecture to deal with faulty conditions by utilizing the available communication resources. The main motivation of this work is to propose a NoC architecture which offers low communication latency and high network throughput and minimizes design overheads. We balance the tradeoff between performance, power consumption and area by proposing an efficient resource utilization technique called partial virtual channel sharing. It also reduces the impact of different fault scenarios on system performance without requiring additional redundant hardware. The proposed architecture maintains the system performance by utilizing the fault-free components which cannot be used due to the faults on the other resources in case of conventional architectures. In addition, the proposed architecture works reliably even if routing logic is faulty and makes the processing element accessible to the network in case of corresponding router failure without affecting the critical path of proposed architecture.

3 Channel Utilization

Efficient resource utilization is necessary to execute a given application with minimized overhead. The first step towards enhancing the utilization of interconnection resources consists of examining the purpose of each network resource individually. If multiple applications execute on an MPSoC simultaneously, the traffic pattern is unpredictable and makes it difficult to analyze utilization of individual resources. The routing algorithm controls the utilization of communication channels. The system then utilizes the router resources according to the load on the incoming channels. We employ channel load analysis to provide the basis for determining the utilization of these router resources in following sections for synthetic and application specific benchmarks.

3.1 Synthetic Traffic Analysis

In synthetic traffic analysis, the average load for each link is determined for a variety of traffic patterns. In our case, uniform, transpose, bit complement and negative exponential distribution (NED) traffic is analyzed with XY routing. In the uniform traffic pattern, a node sends a packet to any other node with an equal probability while in the transpose traffic pattern, each node (i,j) communicates only with node (j,i). For bit complement traffic load, each node (i,j) communicates only with node (M-1-i,N-1-j), if mesh size is MxN. The NED is a synthetic traffic model based on Negative Exponential Distribution where the likelihood that a node sends a packet to another node exponentially decreases with the hop distance between the two cores. This synthetic traffic profile accurately captures key statistical behavior of realistic traces of communication among the nodes [29]. Figure 3 shows the percentage load for each link on the network for different traffic patterns, measured by Eq. 1.

$$ L_{(i,j)\rightarrow{}(k,l)} = \frac{\mathrm{TLL: } (i,j)\rightarrow{}(k,l)} {\text{TNL}} $$
(1)

where,

$$\begin{array}{l}\mathrm{TLL:} (i,j)\rightarrow{}(k,l) = \\ \sum\limits_{\underset{0<x<(M-1)} {0<x'<(M-1)}}\sum\limits_{\underset{0<y<(N-1)}{0<y'<(N-1)}} \begin{cases} (S(x,y), D(x',y')| \\ via (i,j))~\text{Then}~via(k,l) \end{cases} \end{array} $$

and

$$ \text{TNL} = \sum\limits_{\underset{0<m<(M-1)}{0<n<(N-1)}} \sum\limits_{\underset{0<o<(M-1)}{0<p<(N-1)}} L_{(m,n)\rightarrow{}(o,p)} $$
Fig. 3
figure 3

Traffic load analysis for XY-routing. a Uniform traffic load. b Transpose traffic load. c Bit complement traffic load. d NED traffic load

In the above, to measure the total link load (TLL) on a specific link directed from node (i,j) towards node (k,l), the traffic load from the source nodes represented by S(x,y) routed via node (i,j) and then via node (k,l) towards destination nodes represented by D(x’,y’) is considered. The destination node D(x’,y’) could be the node (k,l) because these packets will contribute to the link load for the link directed from node (i,j) towards node (k,l). For total network link load (TNL), the link load of all the interconnection links is summed up. The expression is topology independent and can be extended to any number of dimensions.

The normalized link load percentage computed using Eq. 1 for uniform, transpose, bit complement and NED traffic loads are shown in Fig. 3a, b, c and d respectively with XY routing logic. Consider the node ‘12’ in Fig. 3b. The input ports to receive data from nodes ‘11’ and ‘13’ are not used at all during the whole simulation independent of the total simulation time. But the input ports from left and right receive the traffic load. The traffic load from node ‘22’ is two times the load from node ‘02’. The link from node ‘22’ towards ‘12’ is overloaded but cannot utilize the available resources of other ports. Similar behavior can be observed for odd-even routing.

In all investigated cases, some input ports are overloaded as compared to other ports. In order to balance the load and to enhance the resource utilization, resources can be shared among over- and underutilized input ports. The threshold for distinguishing between over- and underutilization is selected in such a way that half of the ports have a higher load value and the other half has a lower load value.

The resources could be shared among all the input ports, but this would require large crossbar switches, which increases power consumption, area and switching delay. The other option is sharing the resources among multiple ports (but not among all ports) so that loads are balanced, resource utilization is improved, and throughput is close to an architecture with full VC buffer sharing.

In higher-dimensional NoCs (e.g. 3D NoCs), the partial virtual-channel sharing (PVS) approach benefits from the increased number of router ports, which opens more grouping options as listed below:

$$\begin{array}{rll} \text{Typical 2D-mesh:}\texttt{R}(5)&:&<(5), (4, 1), (2, 2, 1), \ldots, (1, 1, 1, 1, 1)>\\ \text{Stacked 3D-mesh \cite{dTDMA}:}\texttt{R}(6)&:&<(6), (5, 1), (2, 2, 2), \ldots, (1, 1, 1, 1, 1, 1)>\\ \text{Typical 3D-mesh:}\texttt{R}(7)&:&<(7), (6, 1), (3, 3, 1), \ldots, (1, 1, 1, 1, 1, 1)> \end{array} $$

where R(n) represents the router with n ports. <(p, q,...)... (f, g, h,...) > represents the set of different grouping options. Each grouping option is denoted as a tuple a group sizes; for example, (3, 2, 2) represents one group of 3 ports and two groups of 2 ports. Ports in the same group can share their resources. In given examples for grouping options, the stacked 3D-mesh NoC is different from typical 3D mesh NoC in terms of inter-layer interconnects. Stacked 3D-mesh NoC is a NoC-Bus hybrid architecture which requires a 6-port router, since the bus adds a single additional port to the generic 2D 5-port router for inter-layer communications in both direction (up/down) [30]. On other hand, typical 3D mesh NoC requires 7-port router with two extra ports for upward and downward communication as compared to the generic 2D 5-port router.

3.2 Application Traffic Analysis

The MPEG4 application introduced in [14] has been selected for resource utilization analysis. The NoC-mapped application and its bandwidth requirements are shown in Fig. 4. Consider, for example, the link loads of the DR-SDRAM node. If its East and South ports share their resources, the heavy load value 942 MB/s from East can utilize the resources of the South port, which receives a smaller load value of 60.5 MB/s. Thus, sharing communication resources among multiple ports can balance the input load on all ports without increasing crossbar size too much. In this case, the average load on input ports is comparable to the average load per port which can be achieved by sharing the resources of all four ports.

Fig. 4
figure 4

MPEG4 application [14]

The ports are grouped so that the load sums of the different groups are balanced. The port with maximum load should grouped together with the port of minimum load. Grouping it with an average load port would not make sense because such port does not have free resources or does not require extra resources. For load balancing, the selection of ports to share the resources should be made at design time according to algorithm 1, described and analyzed below.

The input parameters of the algorithm are the number of router ports, P, the number of VCs per port, V, the vector representing the input bandwidth requirements for each router port, L, and the number of partitions (groupings) of ports, d. In the algorithm, the average bandwidth requirements per VC are represented by l and the number of ports which share the VC buffers in a group is represented by vector S. Similarly, the total bandwidth requirements for each group of ports sharing the VC buffers are represented by vector W. Output of the grouping algorithm are group combinations represented by C.

The total number of VC buffers in the router are P × V. Similarly the total number of VC buffers in group i are S i × V. The value of l can be computed by summation of the bandwidth requirements of all the router input ports divided by P × V. The input ports in group i are grouped to share the VC buffers in such a way that the total incoming load for the group (∑ i L(C i )) approaches the value of l × S i × V. The sum of all the S values is equal to P. Each value of W is equal to l times the corresponding value of S. Different combinations of ports are tested by using a for loop such that the sum of the bandwidth requirements of the ports in the combination is close to the corresponding value of W. The whole process is repeated until the best combination is achieved and the difference between W and the sum of communication bandwidth requirements of the ports in the group is minimal. Finally, the grouping combinations, C, are returned.

The proposed algorithm can support any topology including irregular topologies with any number of ports, P. For example, Murali et al. [23] propose an application specific power efficient topology which requires an eleven port router. If algorithm I is used to generate the grouping combinations according to the input load requirements, further system performance enhancement can be achieved by selecting an optimal combination from the following set of potential groupings:

$$\begin{array}{ll} \texttt{R}(11) : < (11), (10,1), (6, 5), (5, 5, 1), (4, 4, 3),\\ & (3, 3, 3, 2), \ldots,(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) >\end{array} $$

4 Fault Scenarios

In this section, effects of different faults in router and communication links are described. Fault tolerance techniques for PEs are out of the scope of this paper. Approaches like [19,39] can be used to create fault tolerant NIs.

Case 1 Faulty Links

Consider that a fault occurrs on a network inter-router-link. In typical architectures, the input/output buffers connected to this link cannot be utilized anymore. Assume that there are X faulty links in a NoC based system, each input port contains V virtual channels with buffer depth d, and each flit size or buffer width is f bits. If there is a VC controller for each virtual channel, the resources which cannot be used by the system amount to X · V · d · f fault free memory cells, X · V · f connecting wires and X control logic units.

These resources could be switched off using power gating technique to avoid at least their unnecessary power consumption. However, this would leave their chip area wasted. Instead, it is more beneficial to utilize the mentioned resources to improve the system performance or to reduce performance degradation in case of faults. For example, consider the NoC platform shown in Fig. 5. A packet routed from node ‘001’ to node ‘002’ would take the vertical link upward in absence of faults. If the communication link between the nodes is broken as shown in Fig. 5, the packet is rerouted via nodes ‘101’ and ‘102’. A similar situation occurs when node ‘111’ is the source and node ‘000’ is the destination. Now, the resources on the new routes and especially the router at node ‘101’ are overloaded. The input ports from ‘111’ and ‘001’ are overloaded due to the fault, while other inputs might not be congested.

Fig. 5
figure 5

Routing in presence of faulty links

Case 2 Deadlock due to Faults

VCs are used to avoid deadlock as discussed in Section 1. Consider the situation that there are N VCs and out of them, N-1 VCs have a fault. This fault can occur in any part of the FIFO, e.g. Rd./Wr. controller, content counter or memory flip-flops, see Fig. 2. In this case, the architecture becomes equivalent to the typical non-VC architecture. Thus, deadlock can occur if the routing relies on the availability of virtual channels for deadlock avoidance.

Case 3 Load Management

If a FIFO is faulty, the communication performance of the corresponding port is reduced and may be exceeded by the number of packets which can be transmitted over the port’s VCs. Now consider that the neighboring port is free and thus the corresponding VC buffers are available. Without buffer sharing, the overloaded port cannot utilize these resources to manage the load. Such faults become a bottleneck for overall system performance. Previous work does not consider load management mechanisms at micro-architecture level, although there are traffic routing algorithms dealing with load balancing on a higher level as discussed in Section 2.

Case 4 Faulty Routing Logic

If a fault occurs in the logic of a VC allocator, the corresponding physical link and VC buffers cannot be used anymore. Then the traffic needs to be re-routed using some fault-tolerant routing algorithm as discussed in Section 2. The use of non-minimal routes may be necessary. In this case, not only the system throughput is considerably reduced but also there is unnecessary power consumption for VC buffers and control logic. Congestion on some nodes and power consumption due to non-minimal paths may raise thermal issues as there is a vicious circle between heat and power consumption [33]. This scenario is similar to CASE-1 but differs in the faulty resource (VC routing logic instead of inter-router link).

Case 5 Resource Reclamation under Faults

Assume that a fault occurs the in routing logic or VC allocator of one port and that another fault occurs on the physical link of a second port. Without sharing, the physical link of first port cannot utilize the VC allocator and buffers of the second port. So both physical links cannot be used for packet transmission. This may cause the complete router to fail.

Case 6 Processing Element Recovery

Once a router has been marked faulty or its link to the local PE is broken, a well functional PE is isolated from rest of the system. Network level fault tolerance issues cannot address this problem. Multiple NI architecture and default backup paths have been proposed to maintain the accessibility of PE to rest of the system as discussed in Section 2.2. However, multiple NI architecture is inefficient with respect to area and power, and the default backup path approach increases packet latency.

5 Partial Virtual Channel Sharing (PVS) Router Architecture

To address the fault tolerance issues discussed in Section 4, we propose the partial virtual-channel sharing NoC (PVS-NoC) architecture. Due to sharing, the proposed approach enhances VC utilization because free buffers can be utilized by other channels. Maximum VC utilization could be achieved by sharing among all the input ports. However, full sharing increases the control logic complexity and power consumption. Thus, a tradeoff between resource utilization and power consumption is needed.

This tradeoff can be achieved with the PVS approach by forming groups with a limited number of input ports that share resources according to the communication requirements. With this technique, the buffer utilization is increased and comes close to the utilization level of the fully shared architecture without suffering its significant silicon area and power consumption overhead.

Data is injected into the network in the form of packets produced by the Network Interface (NI). While receiving a packet, the NI de-packetizes it and delivers the payload to the PE. The packet format is shown in Fig. 6. The header flit carries the operational code (OP), the source address (SA), and the destination address (DA). The beginning of packet (BOP) and end of packet (EOP) are the indicators of header and tail flits, respectively. SA and DA are composed of two parts: horizontal (X) and vertical (Y) coordinates. The number of bits for X and Y are determined by the number of rows and columns in a 2D Mesh. The extra bit (0/1) in DA is used for PE recovery and will be explained under CASE-6 in Section 6.

Fig. 6
figure 6

Data transmission format

5.1 Virtual Channel Sharing Logic

The PVS approach is implemented on the input ports of the router. The buffer utilization is enhanced by dynamically allocating free buffers to overloaded ports. The definition of which buffers are shared among which ports is parameterized and can be adjusted to match any number of input ports according to the topology requirements. Only the processing element uses dedicated buffers for packet injection which are not shared with any other router ports.

In the PVS approach, the input control logic is responsible for buffer allocation and receiving the data packets. An example of the PVS architecture, with two groups of two channels sharing VC buffers, is shown in Fig. 7. Within each group, each port has its own (distributed) routing logic whereas VC allocation is centralized. Both, VC allocator and routing logic operate independently, without communicating with control logic of other groups. The task of the VC allocator is to keep track of free buffers and to allocate them to the incoming traffic. After allocation, the routing logic computes the route for the packet and controls the crossbar for packet switching. In Fig. 7, ‘Routing Logic_U’ refers to the control logic for the upper group of virtual channel buffers and ‘Routing Logic_L’ refers to the control logic for the lower group.

Fig. 7
figure 7

Proposed PVS approach for conventional Virtual Channel Architecture

5.1.1 Selection Criteria

For load balancing, the selection of ports to share their VC buffers should be made on the basis of the number of router ports, the number of VCs per port, input bandwidth requirements for each input port and the number of groups (sharing VCs). The input ports are grouped to share the buffers in such a way that the total incoming load for the whole group approaches the average value of bandwidth requirements per VC times the number of VC buffers in the current group as described in Algorithm 1.

The MPEG4 application presented by [14] was discussed in Section 3.2 for link load analysis. In the scenario described, the East and South ports of the DR-SDRAM node should share their VC buffers, and the West and North ports should form a second sharing group. The heavy traffic amounting to 942 MB/s at the East port can utilize the resources of the South port which receives less traffic, 60.5 MB/s only. When one of the incoming links of the DR-SDRAM node fails, its VC buffers can be utilized by the other port in the same group when needed. On other hand, under control logic faults, a port can utilize the VC resources of other ports in its group.

5.1.2 Allocation Policy

The PVS approach uses wormhole switching with partial sharing of virtual channel buffers. Due to sharing, race conditions may occur. This happens, for example, if only one buffer is available and multiple channels in the sharing group request ownership of the same buffer. To avoid such situations, ‘Pre-Grant’ signal is used as shown in Fig. 7. ‘Pre-Grant’ is a single bit signal for each physical channel and set to high by default. When two VCs are left for allocation and VC request comes from any of the physical channel, one VC will be granted and ‘Pre-Grant’ signal will be made ’0’ for that physical channels. After that, only the remaining one physical channel can access the remaining one VC. This avoids the race condition. However, the starvation situation can happen to the channel, which does not get the access to VC. To address this problem, whenever a VC becomes free, the ‘Pre-Grant’ signal is raised for the channel, which did not get the access to VC in previous case or is waiting for the longest period. Another option is to assign the priority value for each channel in a sharing group. This approach is more useful for handshake protocol. For application specific NoCs, the priority is proportional to the channel bandwidth requirements. The channel with higher priority value is allowed to use the buffer.

5.1.3 Routing Algorithm

Different routing algorithms can be used with PVS technique. However, there is a possibility of deadlock when more than one input ports share their VCs. For instance, if the North and East input ports share their resources and all the VCs are occupied by the flits coming from the East and going to the West, then the flits traversing from the North to the South direction in the upstream router (i.e., North router) have to wait. If the scenario results in a cyclic dependency, then a deadlock will occur.

In order to avoid deadlock in static routing algorithms such as static XY, atleast one VC should be dedicated for each input port. Therefore, for a PVS unit which has w number of VCs being shared among u number of input ports, u dedicated VCs are needed (one VC for each input port) and the rest (w-u) VCs can be shared. Similarly, since for the dynamic routing algorithms at least two input VCs are required for each input port to avoid deadlock [26], w-2u VCs can be shared.

5.2 Crossbar Switch

The Output part consists of a typical N ×N crossbar switch with central control logic, where N is the total number of ports including the local PE port. The crossbar size can be customized according to the topology requirements. Wormhole switching is used for packet transmission, which makes efficient use of buffer space as the number of flit buffers per VC can be less than the packet size [35].

5.3 Comparison with Existing Architectures

A 5-port router with two unidirectional links per port and 10 internal buffers has been investigated and has been compared with other NoC architectures. Table 1 shows the result of this comparison.

Table 1 Comparison with existing NoC router architectures

The typical VC NoC represents a conventional virtual channel NoC architecture with 2 VCs per port which uses unidirectional channels to communicate with neighboring routers. Thus, two channels are required between two neighboring routers for two way communication. In case of heavy traffic load on a certain port, the typical virtual channel architecture can provide only 2 VCs to receive the packets on that port. The PVS-NoC can provide 4 VCs to the same port under heavy traffic load on the same port. Thus, the VC availability has been doubled with slight overhead of crossbar size.

BiNoC has two bidirectional channels per port, whose direction can be switched at run-time to meet communicate requirements [16]. As compared to the BiNoC architecture with 10 in-out channels, the PVS-NoC approach provides 5 input and 5 output physical channels. PVS-NoC can provide 4 input and 4 output VCs per physical port whereas BiNoC has only two physical channels per port, without VCs. The option of direction selection is provided at the cost of a large crossbar switch. Another issue to be addressed here is scalability. The number of VC buffers can be selected according to the application and topology requirements for our proposed architecture. To insert a new VC, the buffer and a controller are needed without any modification in existing logic, and at the cost of only a slight increase in crossbar resources. To insert a new buffer in the BiNoC architecture, a separate buffer allocator is required and the crossbar is significantly larger.

The DSB-175 and DSB-300 router architectures have been described in [31]. To make an exact comparison, we define a DSB-160 architecture in accordance with [31]. The DSB-160 is a router with 160 flits of aggregate buffering. The buffers are divided between 5 middle memory banks with 16-flit buffers per bank and aggregate 80-flit input buffers comprising two 8-flits buffers (VCs) at each input port. The five memory banks are not considered in comparison Table 1 because only one flit can be written into and read from a middle memory in the DSB architecture, which reduces the utilization of memory banks. Thus the static power consumption without increasing the system performance is the major overhead of DSB architecture as compared to the PVS-NoC.

For further comparison, the Fully Virtual channel shared NoC (FVS-NoC) architecture has been investigated. In this architecture, any of the 10 VC buffers can be allocated to any input port. FVS-NoC channels are are unidirectional. The architecture provides the maximum utilization of VC buffers at the cost of significantly larger crossbars, which makes the solution area and power expensive as compared to the proposed architecture.

6 Performance Sustainability Under Faults

The main feature of the PVS-NoC architecture is to retain the system performance up to a certain level after the occurrence of faults. In NoC based interconnection platforms, a fault can occur in three types of components: physical link, buffer and controller. Subsequently, the fault cases mentioned in Section 4 are addressed. The dashed boundaries (− − −) in figures ranging from Figs. 891011 and 12 represents fault free resources which are not functional due to the faults on other resources.

Fig. 8
figure 8

Resource utilization under faults by PVS approach. a Conventional architecture: fault affecting resource utilization. b PVS architecture: fault in channel not affecting use of buffers

Fig. 9
figure 9

Load management by PVS approach under faults on VC buffers. a Conventional architecture: Faulty VC buffers making the port overloaded and deadlock can occur. b PVS Architecture: Impact of faulty buffers reduced and deadlock avoided

Fig. 10
figure 10

Impact of faulty routing logic on routing

Fig. 11
figure 11

Fault tolerance by PVS approach. a Conventional architecture: faulty routing logic requires re-routing of packets. b PVS Architecture: routing logic fault tolerated

Fig. 12
figure 12

Fault tolerance by PVS approach. a Conventional architecture: faulty routing logic requires re-routing of packets. b PVS Architecture: routing logic fault tolerated

Case 1 Faulty Links - Solution

When a fault occurs on a physical link, buffers and control logic cannot be used by the NoC based system as can be observed in Fig. 8a. If a fault occurs in ‘Channel_0’, its VC buffers and routing logic cannot be used to route a packet. Now, consider this situation for the PVS approach as shown in Fig. 8b. If the fault occurs on ‘Channel_0’, ‘Channel_1’ can utilize the VC buffers and control logic to enhance the system throughput and avoid the unnecessary static power consumption by the VC buffers and control logic.

Case 2 Deadlock due to Faults - Solutions

If too many VC buffers of a given port become faulty, the number of available VCs may drop below the requirements of a deadlock-free routing algorithm. Ann example is depicted in Fig. 9a. For the input port of ‘Channel_0’, all but one VC buffer are faulty. The architecture becomes equivalent to a non-VC architecture and deadlock may occur. Now consider the PVS architecture shown in Fig. 9b. Only one VC buffer is left of the three normally allocated to ‘Channel_0’, but it can benefit from the shared VC buffers of ‘Channel_1’ and maintain enough VCs to operate without deadlock.

Case 3 Load Management - Solution

To balance the load on input buffers and to provide a relief to the loaded ports, the sharing of VC buffers by in the input ports allows to balance the negative effect of a faulty FIFO. In a typical VC architecture, if one or multiple buffers become faulty as shown in Fig. 9a, the port (here: ‘Channel_0’) becomes overloaded as packets are waiting to be routed but less functioning resources are available. In the PVS architecture, as shown in Fig. 9b, the available VC buffers of ‘Channel_1’ can be used to route the blocked packets. Thus, the fault impact will be distributed equally over ‘Channel_0’ and ‘Channel_1’.

Case 4 Faulty Routing Logic - Solution

Consider the 3 × 3 NoC mesh shown in Fig. 10. If a fault occurs at node ‘12’ in the routing logic for the input port from node ‘11’, the packets for node ‘12’ from node ‘11’ is re-routed through the paths shown with blue color according to the assumed fault tolerant routing algorithm [34]. However, any other fault tolerant routing scheme can be used to observe the similar issue. All the resources on that input port cannot be used by the NoC system as shown in Fig. 11a. Without PVS, no packets can be routed through ‘Channel_0’ anymore due the fault of ‘Routing_Logic_0’.

With the PVS approach, the VC allocator marks the routing logic faulty and the VC buffers controlled by that routing logic are not allocated to any packet for the purpose of transmission. However, if the fault occurs on ‘Routing_Logic_0’, the ‘Channel_0’ can still be used as shown in Fig. 11b. Only the buffers shared by ‘Channel_1’ are available for both, ‘Channel_0’ and ‘Channel_1’, i.e. lower bandwidth is available. But the packets do not need to be rerouted and thus the fault can be tolerated with minimized overhead.

Case 5 Resource Reclamation under Faults - Solution

If a fault occurs in the routing logic of one port, the port’s resources cannot be used for packet transmission even if its link and VC buffers are still functional. Similarly, if a fault occurs on a physical link, the connected port cannot receive packets from that link even if its routing logic, VC buffers and mux/demux are functional. Now consider that these faults occur on resource-sharing ports of the same switch as shown in Fig. 12a. That is, faults occur on ‘Routing_Logic_0’ and ‘Channel_1’. Without PVS, both ports cannot receive packets. Also, many functional subcomponents cannot contribute towards system performance while consuming static power. By using the PVS approach, the functional resources can be recovered and ‘Channel_0’ can use ‘Routing_Logic_1’ to receive the packets as shown in Fig. 12b. This sustains the system performance and avoid the waste of power.

Case 6 Processing Element Recovery - Solution

The critical path for packet transmission via shared VC buffers limits the maximum operating frequency. The length of the critical path increases with the number of input ports that share VC buffers due to the increased complexity of the VC controller and the input multiplexer. If the VC buffers dedicated to a local PE are shared with another input port as shown in Fig. 14, the resulting path is not longer than all the other paths in the router. Hence, it will not affect the maximum operating frequency. This gives an opportunity to recover PEs in the presence of network level faults.

For PE recovery, each PE is connected to multiple routers by a multiplexer-demultiplexer pair. For demonstration, PE0 is connected to two routers, R0 and R1, as shown in Fig. 13. Any existing on-line router fault detection technique, for example the one proposed in [1], can be be used in the ‘Fault Detector’ component. When the fault detector detects a fault of router R0, it changes the injection and reception paths of PE0 to backup paths using the multiplexer (Rx) and demultiplexer (Tx). After fault occurrence on router R0, PE0 transmits and receives the traffic via router R1. In this novel approach, each PE can be connected to multiple routers using only a single network interface (NI), keeping the overhead low and the system compatible with single NI cores. In addition, overhead is very low compared to the multiple NI architecture presented by [19]. The critical path length for packet transmission is slightly increased by a 2 ×1 multiplexer, which is significantly lower overhead compared to the backup path approach presented in [15].

Fig. 13
figure 13

PE recovery architecture

Fault tolerant routing algorithms, for example [38] and [21], can only deliver the traffic to the desired destination if the PEś router is functional. Using the proposed PE recovery approach, the traffic can reach at its destination even if the router is faulty but it needs a new addressing scheme.

Consider that router R0 in Fig. 13 is marked as faulty. PE0 can still transmit and receive traffic through router R1. Besides the failure information of R0, the address of the router (R1) which replaces R0 has to be broadcasted in the network. Nodes that generate and transmit data originally destined at R0 have to update their local address information and address their data to destination R1 instead of R0. R1 mjust differentiate between the packets for the local PE of R1 and packets for PE0. This is achieved by an additional bit appended to the destination address (DA) as shown in Fig. 6. Only one additional bit is needed if PEs are connected to two routers. If PE0 is connected to more than two routers, multiple bits are required. Whenever the appended bit is set to ‘1’, the packet is delivered through the backup path to the PE of the neighboring faulty router node (PE0). In the current scenario, it is possible to use all the PEs in the system if 50 % of routers are failed and if only one router is failed in each pair as shown in Fig. 13.

The architecture of the PVS router with PE recovery is shown in Fig. 14. It can be observed that each PVS block at the router input side has two inputs and two outputs whereas the PE recovery block has just one output. It is possible to design all the blocks with a single output, which would result in a 3 × 6 crossbar. However, this would halve the physical output bandwidth. This is acceptable only for the PE recovery block as the backup path for the PE is used only for packet injection and in case of fault occurrence.

Fig. 14
figure 14

PE recovery router

The PE recovery block with two inputs and a single output for packet injection to the network is shown in Fig. 15. To receive the packets from the network, both PEs need the dedicated link from the crossbar. Thus, a 5 × 6 crossbar can be used for the PE recovery architecture. As compared to a 5 × 5 crossbar, a 5 × 6 crossbar requires one more 5 ×1 multiplexer (M 6) without increasing the critical path length whereas a 6 × 6 crossbar increases the critical path length in addition to extra gate count.

Fig. 15
figure 15

PE recovery architecture with dual inputs and single output

The insertion of a 2 × 1 multiplexer (Fig. 14: PE Recovery Architecture) does not affect the critical path of the router if at least two other ports share the VC buffers. This is because the inserted 2 ×1 multiplexer for PE recovery operates in parallel with the other PVS blocks.

7 Simulation Results

To demonstrate performance characteristics of the proposed architecture (PVS-NoC) under faults, a cycle-accurate NoC simulation environment has been implemented in VHDL. The packets have a fixed length of seven flits, the buffer size is eight flits, and the data width is set to 32 bits. The 5 × 5 2D mesh topology is used for interconnection. Each input port has 4 VCs. With the same parameters, typical virtual channel and FVS-NoC architectures are analyzed. The static XY wormhole routing algorithm is used for both non-faulty and faulty scenarios. For the faulty scenario, it is assumed that an appropriate fault detection mechanism (test unit) similar to the one used in [12] detects the faulty links, and stores the fault information in the configuration registers of the routers connected to the faulty link. In this case, these routers will not send any traffic to the corresponding links and will reroute packets through one of the other adjacent routers by using the fault tolerant routing algorithm presented by [34].

The PVS approach with grouping combination of (2, 2, 1) is used in the simulation, where ‘1’ represents the buffer dedicated to the local PE. The critical path limits the operating frequency of the PVS router, which is 3.5 % less than the operating frequency of the baseline virtual channel router. For the grouping combination of (3, 1, 1), the maximum operating frequency is around 9 % less than the maximum operating frequency of the baseline virtual channel router.

7.1 Synthetic Traffic

We compare the simulation results in terms of APL and saturation points for two cases: a normal network with no fault using typical, PVS, and FVS virtual channel management policies, and an example of a faulty network with two faulty links using typical and PVS virtual channel management schemes as depicted in Fig. 16. The system performability under link faults with PVS approach has been discussed in CASE-1 of Section 6.

Fig. 16
figure 16

5 × 5 2D-Mesh NoC with two faulty links

In traffic analysis, we have evaluated the performance of the network using latency curves as a function of the packet injection rate. The packet latency is defined as the time duration between the generation of the first flit at the source node and the delivery of the last flit to the destination node. For each simulation, the packet latencies are averaged over 50,000 packets. Latencies are not recorded for the first 5,000 cycles to allow the network to stabilize. In the simulations, uniform, transpose and NED [29] traffic patterns are used.

The latency curves for uniform, transpose and NED traffic patterns are shown in Fig. 17. It can be observed for all the traffic patterns, the PVS-NoC architecture saturates at higher injection rates as compared to the typical VC architecture, but at slightly lower rates than FVS-NoC architecture. The proposed architecture manages bandwidth limitations by proper resource utilization and by making the load more balanced, without increasing the communication resources. The saturation point of PVS-NoC is just before FVS-NoC because FVS-NoC provides more buffer utilization by sharing the VC buffers among all input ports. However FVS-NoC is not a power efficient solution as verified below with application traffic. For faulty networks, the curves reveal that for all the traffic patterns the proposed architecture suffers less performance degradation compared to the typical architecture. The reason is that the VCs connected to the faulty links are utilized by the other channel which helps to reduce the average packet latency.

Fig. 17
figure 17

Average packet latency vs. packet injection rate for 5 × 5 Mesh 2D NoC with (2, 2, 1) combination of PVS approach. a Uniform traffic load. b Transpose traffic load. c NED traffic load

7.2 Fault Tolerance for Routing Logic

As discussed in CASE-4 of Section 6, if a fault of the routing logic occurs, the PVS architecture can tolerate the fault and packets do not need to be re-routed. To demonstrate that, the 6 × 6 2D mesh NoC with 2 VCs per port is simulated. The values of the other system parameters are the same as in Section 7. For demonstration, the faulty network with two and four faulty routing logic blocks is shown in Fig. 18.

Fig. 18
figure 18

6 × 6 2D-Mesh NoC with faulty routing logics. a Two faulty routing logics. b Four faulty routing logics

The latency curves with faulty routing logic for the network of Fig. 18 are shown in Fig. 19 for uniform, transpose and NED traffic patterns. It can be observed for all the traffic patterns, the PVS-NoC architecture saturates at higher injection rates as compared to the typical VC architecture. The reason is that the PVS architecture does not re-route the packets in case of a routing logic fault. Instead, another routing logic block within the sharing group is used to route the packets through the same channel, and thus the average hop-count is not increased. For the PVS approach, it can be observed that the performance degrades just slightly when the number of faults increases. On other hand, in the typical virtual channel architecture, the system performance is severely affected as the number of faults increases.

Fig. 19
figure 19

Average packet latency vs. packet injection rate for 6 × 6 Mesh 2D NoC with fault on routing logics shown in Fig. 18a. a Uniform traffic load. b Transpose traffic load. c NED traffic load.

7.3 Application Benchmark

For application benchmark analysis, the encoding part of a video conference application with sub-applications of H.264 encoder, MP3 encoder and OFDM transmitter is used. The video stream used for simulation purposes contains frames of size 300 × 225 pixels and each pixel consists of 24 bits. Thus each video frame amounts to 2025 KBytes and can be broken into 8400 data packets of 7 flits, including the header flit, where each flit is 32 bit wide. The Mem. In Video component generates the 8400 packets for one application cycle equivalent to the one video frame. The frame rate for the video stream is 30 frames /second and the data rate for the encoded video stream is 6167 kbps.

The application graph with 26 nodes is shown in Fig. 20. It consists of processes and data flows; data is, however, organized in packets. Processes transform input data packets into output packets, whereas packet flows carry data from one process to another. A transaction represents the sending of one data packet by one source process to another, target process, or towards the system output. A packet flow is a tuple of two values (P, T). The first value, P, represents the number of successive, same size transactions emitted by the same source, towards the same destination. The second value, T, is a relative ordering number among the (packet) flows in the given system. For simulation purposes, all possible software procedures are already mapped to hardware devices. The application is mapped to a 3 × 3 3D-mesh NoC. The details of this mapped application model are presented in [17]. Algorithm 1 was used to determine the individual grouping combination for each node according to the bandwidth requirements.

Fig. 20
figure 20

Video conference encoder application [30]

To estimate the power consumption, the high level NoC power simulator presented by [13] has been extended to support 3D-NoC architectures. The power consumption of the interconnection network (NoC switches, bus arbiters, intermediate buffers, and interconnects) is based on 35 nm standard CMOS technology. The simulation results for APL, power consumption and average router silicon area for the video conference encoding application are shown in Table 2.

Table 2 Experimental result for average power consumption and APL of video conference encoder application

The area of the 3D-symmetric-mesh-based routers (with 7 × 7 crossbars) is computed after synthesizing it with CMOS 65 nm LPLVT STMicroelectronics standard cell library using Synopsys Design Compiler. The results for the average router silicon area of a 3 × 3 × 3 3D NoC are shown in Table 2. Each input port has 4 VC buffers. The buffers’ size is 8 flits and the data width is set to 32 bits. Here, the average silicon area is reported because different sharing combinations for PVS according to algorithm I have different crossbar sizes and thus different silicon area. The figures given in the table demonstrate that the area overhead of the proposed PVS technique more reasonable compared to a fully shared virtual channel technique.

The PVS-NoC shows around 21 % reduction in power consumption and around 7 % reduction in silicon area but around 6 % higher APL over the FVS-NoC architecture. On other hand, the PVS-NoC shows approximately 22 % reduction in APL value but around 8 % higher power consumption and around 4 % larger silicon area over the symmetric 3D-NoC architecture. Thus, the proposed PVS-NoC architecture provides a superior tradeoff between APL, power consumption and silicon area.

7.4 Trade-offs

As already discussed in previous sections, the proposed PVS approach makes a trade-off between system performance, area and power consumption. As shown with synthetic traffic analysis for a 2D mesh NoC, the saturation point of the PVS latency curves comes close to the fully shared virtual channel architecture. On the other hand, PVS saturates at significantly higher packet arrival rate as compared to the typical virtual channel architecture. For uniform traffic load, the PVS architecture does not show significant improvement in saturation point for packet injection rate compared to typical VC architectures. The reason for this is that all the resources are equally loaded. The PVS based network has smaller average packet latency for transpose and NED traffic loads than the typical VC architecture because it is better able to balance link load as discussed in Section 3.1. PVS with (2,2,1) sharing groups requires 7 % less area than FVS. Compared to the typical VC architecture, PVS has only 3 % area overhead. Silicon area with different PVS sharing combinations, again synthesized on CMOS 65 nm LPLVT STMicroelectronics standard cells using Synopsys Design Compiler, are presented in Table 3. It can be observed that PVS area significantly increases as the VC sharing group size increases.

Table 3 PVS-NoC router silicon area for different grouping combinations

In case of video conference encoder application, the PVS architecture shows significant reduction in average packet latency with minor overhead of silicon area and power consumption. Moreover, the proposed architecture shows significant reduction in area and power consumption over the FVS architecture with minor overhead of average packet latency.

8 Conclusions

A novel NoC architecture which retains the system performance despite occurrence of faults has been presented. This architecture can also tolerate routing logic faults without requiring additional logic. The presented architecture has a better tradeoff between resource utilization, system performance and power consumption than conventional VC based architectures. The proposed architecture has been simulated with uniform, transpose, and negative exponential distribution (NED) synthetic traffic. Also, the architecture has been simulated with video conference encoder application traffic. Simulation results show that the average packet latency of PVS-NoC is significantly lower compared to typical VC based NoC architecture and comes close to an architecture with full buffer sharing (FVS). In case of the video conference encoder application, the PVS-NoC consumes 21 % less power and takes 7 % less area than FVS. On the other hand, the PVS-NoC consumes only 8 % more power and takes only 4 % more area than a typical virtual channel architecture. We thus conclude that PVS offers a superior trade-off in providing near-maximum performance at near-minimum area and power cost.