# FIFO Optimization for Energy-Performance Trade-off in Mesh-of-Tree Based Network-on-Chip<sup>\*</sup>

Santanu Kundu, T.V. Ramaswamy, and Santanu Chattopadhyay

Department of Electronics and Electrical Communication Engineering, Indian Institute of Technology, Kharagpur, Kharagpur – 721302, India skundu@ece.iitkgp.ernet.in, elegantfrmsrkr@gmail.com, santanu@ece.iitkgp.ernet.in

**Abstract.** This paper presents an exhaustive study about the impact of FIFO optimization on performance and energy consumption in a Mesh-of-Tree (MoT) based Network-on-Chip (NoC) architecture. A generic NoC router has FIFO at each input and output channel. The paper shows that FIFO is the most energy hungry component in a NoC router. On the design trade-off front, we establish that elimination of FIFO from the output channel reduces energy consumption significantly at the cost of marginal performance degradation.

**Keywords:** network-on-chip, mesh-of-tree, performance and cost evaluation, FIFO optimization, energy-performance trade-off.

# 1 Introduction

Network-on-Chip (NoC) is a new paradigm for designing future many core based SoCs [1] where IP cores communicate with each other with the help of on-chip routers and point-to-point interconnection links. Routers are the main components of interconnection architecture for routing information from source to destination cores. The router has a functionality to store the incoming information and transfer that to any one of its output ports. A generic NoC router has FIFO buffer at every input port to receive incoming data and at every output port to store those data before they are put on the output port [2]. A crossbar switch helps to connect an input buffer to any of the output buffers. There are considerable amount of works that uses both input and output buffering in a router [3–5]. Researchers have also proposed the router architecture having the FIFO buffer only at the input port [6–8]. While [9] reports the buffer placement and sizing to be a very important issue in NoC design, we could not locate any study that clearly proves superiority of one over the other. This motivates us to find out the impact of FIFO on energy and performance.

For NoC, topologies like mesh [10], torus [11], folded torus [12], fully binary tree [13], fat-tree [14], octagon [15], butterfly fat tree (BFT) [5], and Mesh-of-Tree (MoT) [16] have already been proposed. A detailed comparative performance evaluation of a set of recently proposed NoC architectures with Poisson and self-similar traffic models

<sup>&</sup>lt;sup>\*</sup> This work is partially supported by Department of Science and Technology (DST), Govt. of India (SR/S3/EECE/0012/2009), Dt. 20<sup>th</sup> May, 2009.

N. Meghanathan et al. (Eds.): CCSIT 2011, Part I, CCIS 131, pp. 90–100, 2011.

<sup>©</sup> Springer-Verlag Berlin Heidelberg 2011

have been performed in [17]. The authors of [18] have designed MoT based wormhole router architecture to show that MoT performs better than mesh based NoC under Poisson distributed traffic. Although MoT is a potential candidate in interconnection network, its energy consumption profile and area overhead in NoC paradigm is unexplored till date. This also motivates us to perform the energy, area, and performance evaluation by varying the FIFO position and FIFO depth on MoT based network. The contributions of this paper are as follow:

- 1. A detailed performance and energy consumption of MoT based network has been evaluated.
- 2. The trade-off between energy and performance has been carried out by varying FIFO depth and position of MoT based network. The experimental results show that elimination of output buffer reduces the energy consumption significantly at the cost marginal performance degradation.
- 3. The area requirement of MoT based network has been investigated. The area overhead of MoT based networks is quite small compared to the total SoC area.

The rest of this paper is organized as follows: Section 2 describes the methodology for performance evaluation of NoC and associated cost metrics. The experimental results and analysis have been described in Section 3. Section 4 depicts the FIFO optimization for energy-performance trade-off in MoT based network and also focuses the trend of similar study in mesh and BFT based network. The area requirement of MoT based network for different FIFO placement and sizing has also been shown in this section. Section 5 concludes the present work.

# 2 Methodology for Performance Evaluation and Cost Metrics

For evaluating performance of the interconnection network, we have designed a cycle-accurate NoC simulator (in SystemC). The simulator operates at the granularity of individual architectural components of wormhole router. The current version of our simulator supports MoT based network having 32 IP cores where each IP core is taken as a square of side 2.5 mm as considered in [19].



Fig. 1. Hand layout of MoT based NoC having 32 cores

Fig. 1 shows a hand layout of 32 core based NoC mapped onto 20 mm  $\times$  10 mm silicon die in  $4 \times 4$  MoT fashion where 2 layers, horizontal and vertical, are needed for routing. In any tree based topology link length increases with growing network size. As wire delay increases exponentially with its length, in MoT, we have pipelined the links whose length is more than 2.5 mm. The register used for pipelining is shown as small white node, whereas, the routers are denoted as small black nodes in Fig. 1. Thus the length of inter router links and the core to router links are 2.5 mm and 1.25 mm respectively, assuming the inter core spacing is negligible.

Table 1 shows the number of each type of links in MoT network. The router architecture has been implemented in Verilog HDL and synthesized in Synopsys Design Vision in 90 nm technology and the critical path delay is found to be 600 ps. In this paper, we have applied mesochronous clocking strategy to all the routers having same frequency (1.66 GHz) with varying phase. The cores are modeled via traffic generators and traffic receptors. Each traffic generator module generates self-similar traffic [20] as it shows the actual burstiness of the network traffic. In this simulator, the user may also choose between uniform and localized traffic patterns for the packets. Locality factor is denoted by LF and its value is zero for uniformly distributed traffic. It is defined as the ratio of local traffic to total traffic. If LF = 0.5, then 50% of the traffic will go to the nearest cluster from source (S). The rest of the traffic will be distributed according to their distances from the source such that a destination at nearer cluster will get more traffic compared to one at farther cluster. Now, if there is more than one destination in a cluster, the traffic will be randomly distributed among them. In our simulation, we have fixed the packet length to be 64 flits with flit size of 32 bits. The packet injection is continued throughout the entire simulation time of 200,000 cycles of the routers' clock including 10,000 warm-up cycles to make the network stable from initial transient effects. For accurate estimation of the energy consumption by the network, we have assumed that each traffic generator module injects the traffic into the network with a switching activity factor of 0.9 as that is expected to introduce a large number of transitions, and thus power consumption in the network.

| Topology | Number of Links |        |  |
|----------|-----------------|--------|--|
|          | 1.25 mm         | 2.5 mm |  |
| MoT      | 64              | 112    |  |

Table 1. Estimation of link length in MoT topology connecting 32 cores

#### 2.1 Performance Metrics

The performance of an on-chip communication network is characterized by its throughput and average overall latency. Throughput (TP) is the maximum accepted traffic from the network and it is related to the peak data rate sustainable by the network. It is defined as [18],

*TP* = (*Total Packets Received* \* *Packet Length*) / (*Number of IP blocks* \* *Total time*)

Total Packets Received refers to the number of packets that successfully arrive at their destination IPs, Packet Length is measured in terms of flits, Number of IP blocks is the number of functional IP blocks involved in the communication, and Total time

denotes the simulation time (in clock cycles). Network *bandwidth* refers to the maximum number of bits that can be sent successfully to the destination through the network per unit time. It is represented as bits/sec (*bps*).

### BW = (Throughput \* Number of IP cores attached \* flit size) / (clock period)

Depending on the source/destination pair and the routing algorithm, each packet may have a different latency. There is also some overhead in the source and destination that contributes to the overall latency. So, for a given message i, the overall latency  $(L_i)$  is defined as,

#### $L_i$ = sender overhead + transport latency + receiver overhead

Let *P* be the total number of packets reaching their destination IPs, the average overall latency,  $L_{avg}$ , is then calculated as:

$$L_{avg} = \sum_{i=1}^{P} L_i$$

### 2.2 Cost Metrics

*Energy consumption:* Energy consumption of each router is determined by using *Synopsys Prime Power* in *90nm* CMOS technology by running their gate level netlist. We have calculated the number of toggles of every individual I/O pin of the router and their probability of remaining in *logic-1* state for the entire simulation time from our NoC simulator and fed this information to *Prime Power*. We have fixed the clock frequency at *1.66* GHz and simulated it with the following parameters: Process = 1, Supply Voltage = 1V, and Temperature = 75°C.

NoC links have been characterized separately from that of router and have been modeled as semi-global interconnect. Copper wire (resistivity =  $17 n\Omega$ -m) of metal layer 4 has been chosen as interconnection link. The width and thickness of the wire have been taken to be  $0.25 \ \mu$ m and  $0.5 \ \mu$ m respectively. The spacing between two adjacent wires is kept at  $0.25 \ \mu$ m. The spacing between two adjacent metal layers is fixed at  $0.75 \ \mu$ m and is filled by a dielectric material having relative permittivity of 2.9. Fig. 2 shows the cross-section view of interconnect wires sandwiched between two orthogonal metal layers carrying the ground signals. The parasitic resistance, capacitance, and inductance have been extracted by the *Field-Solver* tool from *HSPICE* supporting 90 nm technology with a three wire model.



Fig. 2. Cross-section view of interconnect wires

Energy consumption of the links is determined separately from that of the routers. Energy of a line for different transition can be calculated using the three wire model with the assumption that the coupling effect on the middle line by non adjacent lines is negligible. The input signal is a non ideal signal with rise and fall times of 80 ps and the load capacitance at the other end is 5 fF. Table 2 shows the information about inverter sizing (multiple of minimum sized inverter) and line delay including driver, repeater, and receiver for individual links. Energy consumption of the middle wire has been extracted for all possible 64 transitions on the three wires. This information is used to calculate link energy from the network simulator for 200,000 clock cycles.

| Line    | Driver | Receiver | No. of   | Repeater | Link       |
|---------|--------|----------|----------|----------|------------|
| Length  | Size   | Size     | Repeater | Size     | Delay (ps) |
| 1.25 mm | 30X    | 20X      | -        | -        | 427        |
| 2.5 mm  | 50 X   | -        | 1        | 100 X    | 430        |

Table 2. Buffer sizing and worst case delay in three wire model

*Area requirement:* To estimate the silicon area occupied by each router, we have developed their Verilog models and synthesized using *Synopsys Design Vision* in 90 *nm* CMOS technology supporting *Faraday* library to generate gate-level netlist. The synthesis tool reports the silicon area occupied by the design  $in\mu m^2$ . The inter-router link area estimation has been performed by determining its length from the hand layout of SoC. The width and inter-wire spacing of the links is taken from the interconnect dimension as shown in Fig. 2.

# 3 Experimental Results and Analysis

We have applied the above mentioned evaluation methodology to find out the performance and energy consumption of MoT based network consisting of 32 IP cores by varying offered load in self-similar traffic. FIFO of depth 6 is used at both input and output channels to implement the wormhole router.

### 3.1 Accepted Traffic vs. Offered Load

The accepted traffic depends on the rate at which the IP blocks inject traffic into the network. Ideally, accepted traffic should increase linearly with increasing load at a slope of 45°. However, due to the limitation of routing and arbitration strategy and unavailability of enough buffer space within the wormhole router (FIFO depth << size of the packet), the network suffers from contention. Therefore, the accepted traffic saturates after a certain value of the offered load. Fig. 3a depicts this scenario for uniformly distributed traffic in MoT based network. The maximum accepted traffic where the network is saturated is termed as throughput as defined above and it relates to the maximum sustainable data rate by the network.

### 3.2 Throughput vs. Locality Factor

The effect of traffic spatial localization on network throughput is shown in Fig. 3b. As the locality factor increases, more number of traffic are destined to their local clusters, thus traversing lesser number of hops which in turn increases throughput. For  $4 \times 4$  MoT, the possible distances (*d*) of the destinations from any source are d = 0, 2, 4, 6, and 8. There is only one destination core at the nearest cluster. Moreover, the node degree is less in MoT based network and hence the contention is also less. This explains the increased throughput in MoT with the increase in locality factor.



**Fig. 3.** Performance variation profile. (a) Variation of accepted traffic with offered load with uniform traffic, (b) Variation of Throughput with Locality Factor



Fig. 4. Latency variation profile by varying offered load and locality factor

### 3.3 Average Overall Latency at Different Locality Factors

The average overall latency of any network depends on both the offered load and the locality factor. Fig. 4 shows the average overall latency profile for MoT by varying the offered load and locality factor. It shows that at lower traffic, the latency variation

is not significant. This is due to the fact that at lower traffic, the contention in the network is less, but it will increase as the offered load increases, which in turn increases the latency. The simulation result shows that as the offered load increases towards the network saturation point, the latency increases exponentially which signifies that the packets will take much longer time to reach their destinations. So, it is always desirable to operate the network below its saturation point. The effect of spatial localization of traffic on average overall latency is also studied and is shown in Fig. 4. It has been observed that localization of traffic has significant impact on MoT topology as the latency decreases with increase of locality factor. As the locality factor increases, more traffic will go to local cluster, hence less contention in the network. Therefore, the network will be able to carry more traffic before going to saturation, which in turn enhances the operating point of the network.

#### 3.4 Energy Consumption at Different Locality Factors

Total energy consumption in NoC is the summation of energy consumed by the routers and inter-router communication links. Both these factors are network topology dependent. The total energy consumption of MoT based network for uniformly distributed self-similar traffic has been shown in Fig. 5 (simulation for 200,000 clock cycles with 600 ps clock period has been taken as evaluation parameter).

Fig. 5a shows that the energy consumption increases linearly with the offered load but saturates as the offered load increases to the throughput limit. Beyond saturation, no additional packets can be injected successfully into the network and, consequently, no additional energy is consumed. Fig. 5b depicts the component wise energy consumption of the network at saturation. We have observed that the energy consumption of all the FIFOs is 52% of the total network energy consumption, whereas, all the inter-router links consume only 38% of it. The combined energy consumption of rest of the router components is 10% of the same.

The average energy consumption per cycle of MoT based NoC at saturation with uniformly distributed and localized traffic has been shown in Fig. 6a. It can be noticed that energy consumption decreases as the locality factor increases. This happens due



**Fig. 5.** Total Energy consumption profile in MoT with offered load at uniformly distributed traffic. (a) Network Energy consumed by all Routers and Links, (b) Energy consumed in percentage by FIFOs, wires, and other logics at saturation.

97

to the fact that as the locality factor increases, packets will traverse fewer hops and hence consume lesser switching energy. As FIFO is the most energy consuming component of the network, energy consumption varies with the total number of FIFO used to build the network, or in other words, with the topology. To get an idea about energy spent per packet, we compute the average packet energy. This is another important attribute for characterizing NoC structures. Fig. 6b shows the packet energy consumed at different locality factors at saturation. As the energy consumption profile decreases and the number of accepted traffic increases with increase in locality factor, packet energy consumption also decreases with increase in locality factor.



**Fig. 6**. Average Energy consumption at network saturation for different Locality Factors (a) per Cycle Energy, (b) per Packet Energy

# 4 FIFO Optimization for Energy and Performance Trade-off

In the previous section, it has been observed that FIFO is the most energy consuming element in the network. When contention occurs, the input channel FIFO with higher depth allows more data flits to make progress until an output channel is available. On the other hand, the output channel FIFO with higher depth allows more flits to cross the crossbar when the FIFO of the next router's input channel is full for some cycles. Decreasing the depth of any of the FIFOs will generate full signal frequently and thus the communication affected and hence a negative impact on performance. In this section, we are concentrating on optimizing the FIFO depth and FIFO position for energy and performance trade-off in MoT based network. In our experiment, we have evaluated the throughput, latency, and energy consumption of MoT based network by varying the depth of FIFO in both input and output channels. We have also experimented by eliminating FIFO from the output channel. Table 3 reports the % performance degradation and network energy saving in MoT at saturation for different FIFO size and position at different locality factors. FIFO Depth i-j in 2<sup>nd</sup> column of Table 3 signifies that FIFO at input channel has depth *i* and that at output channel has depth *j* (j=0 signifies no FIFO at output channel). In this study, we have taken FIFO Depth 6-6 (both input and output channel FIFO depths are 6) as the reference.

Table 3 shows that elimination of output channel FIFO in MoT router has very little impact on throughput and latency. As the FIFO depth of all the input channels is 6, after getting the grant signal from any output channel, header will pass to the next router by saving the latency of output channel FIFO. This is diminishing the negative impact of eliminating the output channel FIFO to some extent. Network energy consumption also decreases with reduction of total FIFO size of the router. From Table 3, it can be concluded that elimination of output buffer reduces the energy consumption significantly at the cost marginal performance degradation.

|                       | FIFO  | Locality Factor |         |         |         |
|-----------------------|-------|-----------------|---------|---------|---------|
|                       | Depth | 0               | 0.3     | 0.5     | 0.8     |
|                       | 6-6   | -               | -       | -       | -       |
| %<br>Throughput       | 4-6   | 7.183           | 5.763   | 4.574   | 3.504   |
|                       | 4-4   | 15.024          | 11.318  | 7.944   | 6.261   |
| Degradation           | 6-0   | 1.192           | 1.391   | 0.229   | 0.156   |
|                       | 4-0   | 12.674          | 8.642   | 7.029   | 6.439   |
|                       | 6-6   | -               | -       | -       | -       |
| %<br>Latency          | 4-6   | 44.94           | 58.539  | 53.831  | 102.942 |
|                       | 4-4   | 62.889          | 107.06  | 103.456 | 146.595 |
| Increment             | 6-0   | 13.641          | 12.187  | 11.716  | 14.317  |
|                       | 4-0   | 71.101          | 130.541 | 131.809 | 163.581 |
|                       | 6-6   | -               | -       | -       | -       |
| %<br>Energy<br>Saving | 4-6   | 14.425          | 13.623  | 12.403  | 13.668  |
|                       | 4-4   | 23.93           | 23.206  | 21.516  | 22.235  |
|                       | 6-0   | 36.638          | 36.899  | 36.891  | 36.443  |
|                       | 4-0   | 41.043          | 42.06   | 41.228  | 41.779  |

**Table 3.** Energy Performance Trade-off in MoT at saturation (load = 300)

The final cost metric considered in this study is the overall area requirement to instantiate the MoT based networks. Fig. 7 shows the area requirement for varying FIFO depth and position, in terms of routers and inter-router links. It is observed that



Fig. 7. Area occupied by the routers and wires to form the network of 32 cores having 2.5 mm  $\times$  2.5 mm dimension by varying FIFO depth

the total area occupied by the MoT based network with FIFO Depth 6-6 is 4.10 percent of the total SoC area where each core area is considered to be 6.25 sq. mm. Reduction of FIFO depth or elimination of FIFO from the output channel will reduce the overall area requirement. Therefore, the area overhead of the network is quite small compared to the total SoC area.

# 5 Conclusion

In this work we have explored the performance evaluation and energy consumption of MoT based generic NoC. We have shown that FIFO placement and sizing play crucial roles in determining the performance and energy consumption of NoC. From the experimental results, it can be concluded that instead of reducing the depth of the FIFO, eliminating it completely from the output channel reduces energy consumption significantly at the cost of marginal performance degradation. Moreover, it also reduces the overall area requirement.

# References

- Benini, L., Micheli, G.D.: Network on chips: A new SoC paradigm. IEEE Computer 35(1), 70–78 (2002)
- Jantsch, A., Tenhunen, H.: Networks on chips, pp. 89–90. Kluwer Academic Publishers, Dordrecht (2003)
- Zhang, Y.P., Jeong, T., Chen, F., Wu, H.: A study of the on-chip interconnection network for the IBM Cyclops64 multi-core architecture. In: Proc. of IEEE International Parallel and Distributed Symposium, IPDPS (2006)
- Wentzlaff, D., Griffin, P., Hoffmann, H., Bao, L., Edwards, B., Ramey, C., Mattina, M., Miao, C.C., Brown, J.F., Agarwal, A.: On-chip interconnection architecture of the TILE processor. IEEE Micro 27(5), 15–31 (2007)
- Pande, P.P., Grecu, C., Ivanov, A., Saleh, R.: High-throughput switch-based interconnect for future SoCs. In: Proc. of IEEE Int'l Workshop on System-on-Chip for Real-Time Applications, pp. 304–310 (2003)
- Kumar, A., Kundu, P., Singh, A.P., Peh, L.S., Jha, N.K.: A 4.6Tbits/s 3.6GHz single-cycle NoC router with a novel switch allocator in 65nm CMOS. In: Proc. of IEEE International Conference on Computer Design (ICCD), pp. 63–70 (2007)
- Zeferino, C., Susin, A.: SoCIN: a parametric and scalable network on chip. In: Proc. of Symp. on Integrated Circuits and System Design (SBCCI), pp. 169–174 (2003)
- Kavaldjiev, N., Smit, G.J.M., Jansen, P.G.: A Virtual Channel Router for On-Chip Networks. In: Proc. of IEEE Int'l SOC Conference, pp. 289–293. IEEE Computer Society Press, Los Alamitos (2004)
- Ogras, U.Y., Hu, J., Marculescu, R.: Key Research Problems in NoC Design: A Holistic Perspective. In: Proc. of the IEEE/ACM Int'l Conf. on HW-SW Codesign and System Synthesis (2005)
- Kumar, S., Jantsch, A., Soininen, J.P., Forsell, M., Millberg, M., Oberg, J., Tiensyrja, K., Hemani, A.: A Network on Chip Architecture and Design Methodology. In: Proc. of ISVLSI, pp. 117–124 (2002)
- Dally, W.J., Towles, B.: Route Packets, Not Wires: On-Chip Interconnection Networks. In: Proc. of DAC, pp. 684–689 (2001)

- Dally, W.J., Seitz, C.L.: The Torus Routing Chip. Journal of Distributed Computing 1(4), 187–196 (1986)
- Jeang, Y.L., Huang, W.H., Fang, W.F.: A Binary Tree Architecture for Application Specific Network on Chip (ASNOC) Design. In: IEEE Asia-Pacific Conf. on Circuits and Systems, pp. 877–880 (2004)
- Guerrier, P., Greiner, A.: A Generic Architecture for on-chip packet-switched Interconnections. In: Proc. of DATE, pp. 250–256 (2000)
- Karim, F., Nguyen, A., Dey, S.: An Interconnect Architecture for Networking Systems on Chips. IEEE Micro 22(5), 36–45 (2002)
- Kundu, S., Chattopadhyay, S.: 'Mesh-of-Tree Deterministic Routing for Network-on-Chip Architecture. In: ACM Great Lake Symposium on VLSI (GLSVLSI), pp. 343–346 (2008)
- Pande, P.P., Grecu, C., Jones, M., Ivanov, A., Saleh, R.: Performance Evaluation and Design Trade-Offs for MP-SOC Interconnect Architectures. IEEE Trans. on Computers 54(8), 1025–1040 (2005)
- Kundu, S., Chattopadhyay, S.: Network-on-chip Architecture Design based on Mesh-of-Tree Deterministic Routing Topology. Int'l Journal of High Performance Systems Architecture 1(3), 163–182 (2008)
- Feero, B.S., Pande, P.P.: Networks-on-Chip in a Three-Dimensional Environment: A Performance Evaluation. IEEE Trans. on Computers 58(1) (2009)
- Park, K., Willinger, W.: Self-Similar Network Traffic and Performance Evaluation. A Wiley-Interscience Publication, John Wiley & Sons, Inc., (2000)