Keywords

11.1 Introduction

The rapid advancement of semiconductor technologies makes it possible to integrate dozens of cores on a single chip. With more and more cores, the on-chip communication architecture design encounters more challenges in various aspects including the throughput, latency, power consumption, signal integrity, and clock synchronization. Traditional bus-based interconnect architectures are inherently non-scalable, which constitutes a bottleneck for the on-chip communication. The emerging Network on Chip (NoC) provides an effective, reliable and flexible infrastructure for system modules based on data packet transmission scheme. It has become an effective solution to overcome difficulties associated with global interconnections and communications in complex System on Chip (SoC) designs [1].

NoC architectures are constructed using topologies. A topology describes the overall connection forms between routers and resource nodes. The floorplan of a topology determines the length and complexity of the on-chip connections, and as a result, significantly affects the network latency, throughput, area cost and power consumption. Network topologies of NoC can be classified into two categories, regular and irregular architectures. Regular topologies, as used in most NoC designs (e.g., mesh and torus), have the advantage of reusability and low design complexity. However, with regular topologies, applications cannot be well optimized. This may lead to large-scale redundant routers, low link utilization rate, and local congestion. For example, the number of routers on a mesh architecture is fixed irrespective of how many of them are actually used. The same happens to the links between routers. Even if unused routers and links can be shut down, they still occupy area on the chip. Irregular topologies, on the other hand, are designed to be application specific and therefore, are tailorable for each design. Compared to regular topologies, they usually use fewer routers and links, while offering better system performance and lower cost [2].

In this chapter, we focus on network topology generation for the custom irregular architecture. Specifically, we propose a clustering-based topology generation approach for application-specific NoC. Parts of our work have been presented in [3] to minimize the network communication power consumption. This chapter expands the previous work with a further analysis of the feasibility to address the problem of application-specific 3D NoC topology generation using the proposed approach.

The rest of the chapter is organized as follows: Section 11.2 summarizes related work; Sect. 11.3 describes the problem definitions; Sect. 11.4 presents our topology generation approach with an example; Sect. 11.5 discusses the possible extension of the current approach; experimental results are discussed in Sect. 11.6, and finally the conclusion is made in Sect. 11.7.

11.2 Related Work

There are many advantages of using irregular topologies over regular topologies for application-specific NoC [4]. However, generation of irregular topologies calls for scalable topology generation algorithms [511]. In [5], the authors present a technique for constraint driven communication architecture synthesis of point to point links. The technique results in network topologies that have only two routers between each source and sink, and does not address routing for each communication trace. The work in [6] presents the mixed integer linear programming (MILP) based topology generation. However, this method is constrained by the exponentially increasing solution times for large communication trace graphs. Different optimization techniques have been proposed to address the problem of topology generation within reasonable time [710]. In [7] and [8], genetic algorithm based topology generation approaches are proposed, which obtain better results and less runtimes compared to the MILP technique. The author of [10] proposes a combination of the depth first search and the AO* algorithm to generated a near-optimal topology. However, these techniques have greater computational complexity due to a sufficient number of iterations.

In [11], a three-step topology generation algorithm called PATC is presented, which includes core cluster, core cluster optimizing and physical router mapping. The author of [12] proposes another simpler method called TopGen to cluster the given application based on the communication characteristic, and thereafter, construct the topology by connecting the clusters to each other one by one.

In this chapter, we propose a four-phase approach of topology generation analogous to those used in [11] and [12], but completely different in the algorithm design. The proposed approach is verified and compared to those using regular NoC topology and existing algorithms on multimedia benchmarks, which shows that our approach achieves better results.

11.3 Problem Formulation and Definitions

An NoC architecture consists of interconnected routers that are responsible for routing data packets on the communication architecture. As shown in Fig. 11.1a, a router is composed of switch fabrics, a routing and arbiter unit, an input port and output port module. Every resource node (IP core) should be connected to a router through input and output port channels, which consist of two unidirectional links. Each link can connect to a core by a network interface (NI) implemented with open core protocol (OCP), or connect to other routers directly to expand the architecture [11], as shown in Fig. 11.2b. In this case, designers can construct different regular or irregular NoC topologies based on the requirements and design constraints.

Fig. 11.1
figure 1

The router structure and NoC architecture, a The router structure of NoC b NoC architecture

Fig. 11.2
figure 2

The flowchart of the clustering algorithm

The topology generation problem can be formulated as follows.

Given a core communication graph denoted by CCG(C, A), where each vertex ci ∈ C represents an IP core, and each directed edge a i,j  ∈ A represents the communication trace from IP c i to IP c j . Every edge has two attributes, denoted by b(a i,j ) and l(a i,j ), which represent the bandwidth requirement in bits per second (Mbps) and the latency constraint in hops respectively.

Given a characterized library £ of the router architectures, with η denoting the number of input and output ports of the router, and Ω denoting the peak bandwidth that can be supported by the router on any one port.

Find a NoC topology T(R, E), where R ∈ £ represents the set of routers chosen to use from library £ in the topology generation, and E represents the set of links between the routers.

Such that:

  1. (1)

    Each IP core c can be mapped onto a port of a router r, and the maximum number of cores mapped on a router should less than η.

  2. (2)

    For each a i,j A, there exists a unique path p i,j  = {(r i , r k ), (r k , r m ), … (r n , r j )} ∈ P in T that satisfies communication latency and bandwidth constraints.

  3. (3)

    The total communication power consumption is minimized:

$$ \min E(A) = \sum\limits_{{\forall a_{i,j} \in A}} {b(a_{i,j} ) \times E_{bit}^{{c_{i} ,c_{j} }} } $$
(11.1)

where

$$ \begin{gathered} E_{bit}^{{c_{i} ,c_{j} }} = \sum\limits_{{r \in p_{i,j} }} {E_{Rbit} } + \sum\limits_{{e \in p_{i,j} }} {E_{Lbit} } \hfill \\ \begin{array}{*{20}c} {} & {} \\ \end{array} = (d(p_{i,j} ) + 1) \times E_{Rbit} + d(p_{i,j} ) \times E_{Lbit} \hfill \\ \end{gathered} $$

\( E_{bit}^{{c_{i} ,c_{j} }} \) represents the energy consumed when one bit of data is transported through the routing path \( p_{i,j} \); \( E_{Rbit} \) and \( E_{Lbit} \)are the energy consumed on the router and the link respectively [11].

Since \( E_{Rbit} \) and \( E_{Lbit} \) are constants, the NoC power consumption varies linearly with the communication amount and routing distance, which can be represented by:

$$ \min E(A) = \sum\limits_{{\forall a_{i,j} \in A}} {b(a_{i,j} )} \times d(p_{i,j} ) $$
(11.2)

Therefore, we try to cluster high communicative cores into the same router so that data exchanges among these cores consume minimized communication power consumption as calculated by (11.2).

11.4 Topology Generation Approach

The main idea of our proposed approach is to assign high communicative cores to the same routers or nearby routers, and subsequently, determine the optimal connection between routers. The goal is to minimize the total number of communication hops for communication IP core pairs, as well as to reduce the number of used routers and links in the NoC topology. The approach consists of four phases: (1) core clustering, (2) cluster and router mapping, (3) router connection construction, and (4) topology optimization. Each phase of the approach is described in detail as follows.

11.4.1 Core Clustering

In the first phase, we partition the IP core set for a given application into several clusters under the design constraints. The flowchart of the clustering algorithm is shown in Fig. 11.2.

Step 1: Algorithm Preparation. We define a variable N max , which denotes the maximum number of cores in each cluster. Since IP cores in the same cluster will be mapped to different ports of the same router in a topology, and each router must be connected to the topology on at least one port, N max = η−1. Then, we sort each communication trace a i,j in descending order according to the communication weight b(a i,j ).

Step 2: Clusters Initialization. Clustering is to partition vertices of CCG(C, A) into k non-empty sets C1, C2,…, C k . Each cluster C i (i = 1, 2,…, k) contains N max cores at most. In the initialization, each vertex of CCG(C, A) forms a cluster partition, that is CP = {C1, C2,…, C n }, where C i = {c i }, i = 1, 2,…, N, N is the number of vertices of CCG.

Step 3 and 4: Clusters Merging. According to the order of communication traces in step 1, we first process the edge ai,j with highest communication weight. Let a i,j  = (c i , c j ), if c i and c j belong to different clusters, and if the core number in the new cluster is not greater than N max after merging, calculate the inter-cluster communication amount among clusters after merging. If the calculated amount is less than the previous one, merge the clusters, otherwise not.

Step 5: Results Output. When all the edges have been processed in sequence, we obtain the best number of clusters with minimum inter-cluster communication amount.

For example, we give CCG in Fig. 11.3a, in which the labels of the edges in CCG denote the bandwidth requirement. Assuming the number of router ports η is 4, each partitioned cluster contains N max = 4−1 = 3 cores at most. According to the above clustering algorithm, the CCG can be divided into four clusters C1, C2, C3, C4, as shown in Fig. 11.3b.

Fig. 11.3
figure 3

Core clustering example, a Core communication graph b Clustering result

11.4.2 Cluster and Router Mapping

In the second phase, we map each cluster to a router. The router number used in the generated topology is equal to the number of clusters. Every IP core in the cluster is mapped to a port of a router randomly.

For the core clustering results shown in Fig. 11.3b, the clusters need to be mapped to four routers, denoted by r 1, r 2, r 3, r 4 respectively. As shown in Fig. 11.4, the core c 1 in the cluster C1 is mapped to port 0 in the router r 1 , and the cores in the cluster C2 are mapped to three ports in the router r 2 .

Fig.11.4
figure 4

Cluster and router mapping

11.4.3 Router Connection Construction

In the third phase, the routers mapped with IP cores are connected to form the initial topology. We sort the clusters in ascending order according to their number of cores. For clusters with the same number of cores, we sort them in descending order according to their communication amount. Then, we use a recursion based link construction algorithm to generate router connections.

Before describing the recursion based link construction algorithm, it is worth pointing out that, the communication amount of a certain cluster is calculated as the sum of the inter-cluster communication amounts between this cluster and all others. Such sort will make the communication trace with high communication weight get shortest communication path in advance, and as a result, minimize the communication power consumption.

The idea of our proposed recursion based link construction algorithm is as follows. First, the source and destination routers for each communication trace are obtained according to current router selection and port mapping results; then, under the bandwidth and latency constraints, the following three ways are attempted to recursively search the path from the source router to the destination router:

  1. (1)

    Use the existing links between source and destination routers;

  2. (2)

    Use the empty port of routers without placing IP core between the source and destination router to build new links;

  3. (3)

    Use the links built by previous communication trace from the source or destination router to other routers.

Through the above recursively search process, we can construct router connections by allocating a routing path for each communication trace.

The pseudo code of the recursion based link construction algorithm is shown in Fig. 11.5. The return value of the routine get_next_rtr(r i ) is r next which is connected to the router r i. The constructed link between router r i and r next should satisfy the bandwidth and latency constraints. The adjacency matrix RAdj[MR][MR] represents the interconnection relation among routers, where MR is the number of used routers in the topology generation. The initial value of the matrix elements is 0, and the value is between 0 and ∞ if there exists a link among routers. After allocating paths for all the communication traces, each element in RAdj[MR][MR] is checked to ensure that its value does not exceed the supported bandwidth Ω. The port information list PortList is used to record the status of each router port. The status indicates whether the port is empty or connected with IP cores or other routers.

Fig. 11.5
figure 5

The pseudo code of the link construction algorithm

As an example, the number of cores in cluster C1 and C4 is identical as shown in Fig. 11.3b, and the communication amount of cluster C1 is 5 which is larger than that of cluster C4. As a result, the routing path for communication trace between cluster C1 and C2 is allocated first, and port 3 is connected to port 5 to construct a routing path. Then, the routing paths for other two communication traces between C4 and C2, C3 and C2 can be allocated. Eventually, after completing path allocations for all the communication traces, connection among routers can be constructed. The initial topology of the mapping results in Fig. 11.4 is shown in Fig. 11.6.

Fig. 11.6
figure 6

Initial topology

11.4.4 Topology Optimization

The last phase is to merge adjacent routers with empty ports until no adjacent routers can be merged. This further reduces communication power consumption and resources costs. As an example shown in Fig. 11.6, there exist empty ports in router r 1 and r 4, thus router r 1 can be merged with router r 4, leading to the final NoC topology as is shown in Fig. 11.7.

Fig. 11.7
figure 7

Final topology

In order to evaluate the time complexity of our proposed approach, let n be the number of vertices in the core communication graph, and a be the number of edges in the core communication graph CCG. Since each cluster contains at most n elements and there exists a maximum of n clusters, the complexity of inter-cluster communication amount calculation is O(n 2). All the edges should be traversed, so the time complexity of cluster partitioning is O(a × n 2). Consequently, the overall time complexity of the algorithm is estimated to be O(a × n 2).

11.5 Experimental Results

In this section, we present the experimental results obtained by executing the proposed approach on various multimedia benchmark applications. We generated custom irregular NoC topologies for seven combinations of four multimedia benchmarks: MP3 audio encoder, MP3 audio decoder, H.263 video encoder, and H.263 video decoder [5]. In addition, we obtained results for three other benchmarks: MPEG4 decoder, video object plane decoder (VOPD), and multi-window display (MWD) [2]. Table 11.1 lists the graph IDs and sizes of the CCG of the various benchmarks.

Table 11.1 Graph Characteristics

In order to evaluate the efficiency of the proposed approach, we compared the results produced by our clustering-based topology generation approach (Cluster-TG) against the solution of mapping benchmark applications onto regular Mesh topology. The selection of Mesh topology for comparison is due to the fact that, Mesh topology is proved to outperform other regular NoC topologies with respect to power consumption and area costs, and it can be easily implemented on chips. The number of router ports η is set to be 4, and the supported bandwidth Ω is set to be 1 GB/s.

Figure  11.8 presents the results of the comparison in communication power consumption of NoC topology generated by Random-Mesh, Optimal-Mesh and Cluster-TG. ‘Random-Mesh’ represents the solution of mapping IP cores in benchmark applications onto regular Mesh topology randomly. ‘Optimal-Mesh’ represents the solution of mapping IP cores onto optimized regular Mesh topology by the genetic algorithm based approach in [13]. Figure 11.9 shows the comparison of router and link utilities. As seen from the figures, a much better performance in communication power consumption and resource costs has been achieved using our approach compared to that of the regular Mesh topology. On average, our approach saves about 61.5 % of communication power consumption compared to Optimal-Mesh.

Fig.11.8
figure 8

Communication power consumption comparison

Fig.11.9
figure 9

Resource costs comparison a The number of routers, b The number of links

Another experiment is conducted to compare the results of two multimedia applications, VOPD and MWD, generated by Cluster-TG, TopGen [12] and PATC [11] respectively. The resource costs of the applications using different approaches turn out to be about the same, and the power consumptions are compared in Fig. 11.10. It can be seen that our proposed approach achieves results that are better than PATC, and commensurate with TopGen. As an example, the CCGs and the generated irregular topologies of the VOPD and MWD benchmarks are illustrated in Figs. 11.11 and 11.12 respectively.

Fig. 11.10
figure 10

Power consumption comparison for different approach

Fig. 11.11
figure 11

The CCG of application VOPD and MWD, a The CCG of VOPD, b The CCG of MWD

Fig. 11.12
figure 12

The final irregular topology of application VOPD and MWD

11.6 Possible Extension

The advent and increasing viability of 3D silicon integration technology make it possible to scale NoC over the third dimension [14]. As a result, 3D NoC is arousing more and more research interest. My proposed approach can be extended to application-specific 3D topology generation with metrics of 3D NoC taken into consideration.

In 3D NoC, IP cores are distributed on different 2D layers, and multiple device layers are stacked on top of each other with direct vertical interconnects tunneling through them using through-silicon vias (TSVs). Every IP core also should be connected to a router in 2D layers. The router connects to other routers in the same layer using horizontal links, and connects to other routers in the adjacent layers using up/down port and vertical links.

The approach for application-specific 3D NoC topology generation also should consist of four phases: core clustering, cluster and router mapping, router connection construction, and topology optimization. However, the problem introduces new issues, such as the technology constraint on the number of TSVs that can be supported, accurate power models for 3D interconnects.

In the phase of core clustering we first partition the IP core set for a given application (the example CCG is shown in Fig. 11.13a) into several clusters under the constraint on the number of TSVs, and make the IP cores in different clusters distribute on different 2D layers, as shown in Fig. 11.13b. Then IP cores in the same layer are further partitioned into clusters according to the algorithm in Sect. 11.4.1. In the phase of router connection construction, the routing path allocations for communication traces maybe use the vertical links among routers in adjacent layers, as shown in Fig. 11.13c. The construction of vertical links should meet the constraint on the number of TSVs

Fig. 11.13
figure 13

Illustration of the 3D NoC topology generation problem, a Core communication graph b Clustering result, c The construction of vertical links

Additionally, the power model in 2D NoC should be extended to 3D NoC by including the power consumed on vertical links.

11.7 Conclusion and Future Work

This chapter presents a four-phase clustering-based topology generation approach for application-specific NoC. The aim is to reduce the network communication power consumption. Under the constraints of the bandwidth and latency, the approach designs custom irregular NoC topologies according to the communication requirements of the given application and characteristics of router architectures. Specially, a recursion based link constructing algorithm embedded in the topology generation is proposed to construct links between routers. Applying our approach on various multimedia benchmark applications gives experimental results showing significantly improved performance as compared to those using regular Mesh topology and existing algorithms. The detail analysis of 3D NoC topology generation using our approach will be done as future work.