1 Introduction

Today’s embedded systems are becoming more complex, with many cores, accelerators, and Intellectual Property (IP) blocks, offering more modularity, scalability, and processing power than ever before [1, 2]. Nevertheless, these advantages come at the expense of the performance due to (1) the activity management; such as task mapping, task movement, Quality of Service (QoS) processing, and system monitoring, and (2) the complexity; as the embedded system’s functionality expands, numerous constraints are required, such as area, throughput, memory, power consumption, and time-to-market.Footnote 1 Therefore, interconnections between the various IPs are limited by particular constraints. Hence, IP-communication has emerged as one of the significant challenges confronting the performance of modern embedded systems. Customarily, bus architectures were the solution. However, as most present-day applications’ scalability, heterogeneity, and constraints increase, bus architectures fail to meet the requirement, particularly regarding throughput and bandwidth [3,4,5,6]. Networks-on-Chip developed as an answer for interconnection challenges to tackle the pitfalls of traditional bus architectures [7,8,9]

The assessment of the correctness and performance of NoC-based architectures involves extensive simulation and hardware emulation techniques [10]. The simulation approach describes the architectural designs in software routines to speed up the development time. However, this technique degrades as designs scale up. It slows down as the number of IPs per system design increases due to the complexity of the synchronizations and inter-and-intra IPs communications [11,12,13]. On the other hand, hardware emulation models fine-grained parallelism effectively and operates at ultra-high speed compared to simulators. it defines architectural designs in Hardware Description Languages (HDLs) [14, 15]. It provides high cycle-level accuracy and detects design issues early on. It supports exploration and validation for various parameters by re-configuring the corresponding FPGAs without re-synthesizing the whole architecture. Emulation systems, like Veloce [16], exploit the co-simulation combined with the transaction-level methodology (co-emulation) [17], where the transactor, interfaced with the DUT, runs on the emulator through the testbench.

This paper proposes a framework to evaluate and verify NoC-based architectures through hardware emulation, where run-time errors are captured, utilizing Universal Verification Methodology (UVM). UVM is a standard portable open-source verification library to evaluate and verify advanced digital architectures [18]. UVM verification environments can be reused for NoC-based designs with different configurations, network dimensions, and topologies [19, 20]. We propose a framework that auto-generates a scalable NoC-based MPSoC design and its UVM verification environment. It can run in simulation and emulation, and it has an extensive capacity to support various NoC configurations, testbench acceleration, and power analysis. We used the RISC-V as a processor tile to provide real traffic patterns into NoCs through a portable Core Network Interface (CNI). To the best of our knowledge, no previous work has explored using hardware emulation for NoC-based architecture verification through UVM. Below, we list the key contributions of this work.

  1. 1.

    NoC-based emulation and verification framework enable functional and timing verification of several NoC-based architectures at different levels of abstraction and configuration [21].

  2. 2.

    Design-space exploration automation, where the user defines design-space and target performance. The corresponding UVM verification parameters are auto-generated and updated. Then, the framework compiles, emulates, evaluates the whole NoC-based architecture, and provides the best performance based on the provided user criteria.

  3. 3.

    Evaluation and performance analysis for various configurations and parameters supporting 2D and 3D NoC-based architectures and utilizing real traffic patterns injected by RISC-V PEs (Processing Elements).

The rest of the paper is organized as follows. Section 2 discusses related work. Section 3 details the proposed framework and the hardware emulation flow. Section 4 presents the test-cases and experimental results. Finally, Sect. 5 concludes the paper and presents the future work.

2 Related work

In recent years, research focused on developing several simulation and emulation frameworks to verify and evaluate NoC-based designs. Connect [22] and MAIA [23] are open-source NoC verification frameworks centered around Verilog test-cases supporting re-usability and scalability. ATLAS [24] provides simulation and hardware emulation utilizing SystemC for prototyping and verifying NoC architectures. Xu et al. evaluate bus-based and switch-based on-chip networks and analyze system performance at a cycle-level accuracy [25]. Cheung et al. present the INSIDE framework that quickly scans the structure space for extensible processors and considers the area and performance constraints of an embedded application [26]. INSIDE estimates the performance of an application, focusing only on the processor behavior context.

Monemi et al. propose the ProNoC framework, which is an automated tool for rapid prototyping and validating the NoC-based platforms targeting FPGA [27]. They present an evaluation comparison against other NoC simulators. Busseuil et al. present the Open-Scale platform, which is a scalable open-source framework that can be used for design space exploration for NoC-based memory MPSoCs [28]. Zhang et al. [29] present an NoC-based homogeneous multi-cores framework based on x86 processor architecture and shared distributed memory. The framework is based on a high-speed Network Interface (NI) using GEMS and Booksim NoC simulators. However, they do not provide a debugging tool for platform verification or emulation, and do not support the RTL model level.

Balkind et al. present OpenPiton system, which is an automated open-source framework based on a general-purpose multi-core processor [30]. OpenPiton supports a complete ad-hoc Verilog verification infrastructure synthesizable to Xilinx FPGAs. It supports homogeneous architectures. Skalicky et al. present a hardware and software co-design structure for MPSoCs frameworks, focusing on FPGA prototyping [31]. The configurable platform automatically compiles, synthesizes, and generates heterogeneous systems. Prabhu et al. present an NoC simulator framework based on FPGA acceleration [32]. They evaluate their framework on limited 2D mesh network configurations of \(6\times 6\) and \(8\times 8\) utilizing five-port router architecture. Ruaro et al. propose Memphis framework supporting many-core tiles with a hierarchical organization and connection to the peripherals at the chip borders to feed applications [33]. The framework is a cycle-accurate, with a SystemC prototype to accelerate simulation time and a VHDL prototype to synthesize on FPGA.

Table 1 Comparison between the proposed framework and other related MPSoCs design frameworks

In summary, several cycle-accurate simulation frameworks in VHDL and SystemC have been proposed for NoC design space exploration [34,35,36]. However, they cannot perform real-life injection traces at a fast speed. Hence, FPGA-based emulation frameworks are proposed to reduce validation time [37, 38]. However, these proposed frameworks could not emulate complete and large-scale NoC-based architectures as FPGA resources limit them. Besides, they support limited NoC configurations and applications like multimedia [39,40,41,42], and none of them at any point utilize RISC-V core. Table 1 compares related works and our proposed framework to clarify the difference. As shown, our framework supports emulation utilizing UVM, multiple-routers implementations and application-traffic patterns driven by RISC-V.

3 Proposed framework

In this section, we present our hardware-software emulation framework. Our framework: (1) implements a generic configurable and portable UVM verification environment that provides accurate simultaneous performance analysis; (2) supports auto-generation for simulation and hardware emulation with different configurations. In brief, the framework automatically generates the Hardware Description Language (HDL), and Hardware Verification Language (HVL) models, compiles, simulates, synthesizes, emulates, and reports the final performance results; (3) enables evaluation and verification of large-scale MPSoC, including 2D and 3D NoC-based architectures. It supports real traffic patterns utilizing RISC-V as a processor tile and connects external peripherals through the AXI bus interface; (4) evaluates the performance of different parameters, configurations and real-life applications. Fig. 1 illustrates the main three layers: (a) the hardware NoC-based architecture layer, (b) UVM and the emulation layer, and (c) the software layer. Below, we describe each layer in detail.

Fig. 1
figure 1

Proposed hardware emulation framework

3.1 Hardware NoC-based architecture layer

Figure 2 details the hardware NoC-based MPSoC layer. It is divided into three main components: (1) RISC-V core PE, which injects and collects packets according to the real-traffic patterns by applications; (2) CNI, which interfaces the RISC-V PE with the NoC router; (3) NoC architecture, an auto-generated network-on-chip with different configurations and parameters (2D and 3D, topology, buffer size, virtual channels, etc.).

Fig. 2
figure 2

Detailed architecture of hardware NoC-based MPSoC building block

3.1.1 RISC-V processor tile

The application or user-configured traffic pattern is written in C and ported to RI5CY core through a custom tool-chain GCC compiler. Accelerators, co-processors, and different I/O peripherals communicate with the RISC-V core through the AXI4 bus interface, as depicted in Fig. 2 [43, 44]. Although this work principally centers around the RI5CY core, other IP cores can be used easily, but the application should be recompiled. It should be highlighted that our framework is the only framework that supports the AXI bus interface with processor tiles, which facilitates the plug-and-play replacement of different processor tiles or IPs. Other frameworks use custom direct-memory interface, e.g., ProNoC [27] implements Wishbone interface. Besides, Debug Access Port (DAP) is implemented to provide debugger access to system components through the Joint Test Action Group (JTAG).

3.1.2 Core network interface (CNI)

Core Network Interface (CNI) [45] is modified to support 2D and 3D NoC architectures. CNI connects the processor tile and the auto-generated NoC-based architecture. CNI functions are to: (1) fetch the packets from the read-write register-bank ; (2) format the packets with designated control information, such as the source, destination, packet length, packet ID, interval cycle, and others according to the NoC configuration; (3) divide the packets into an appropriate number of flits and compose head, body, and tail flits; (4) inject the flits into the input port at the source router; (5) collect the flits from the output port at the destination router; (6) reformat the flits to packets ; (7) store the collected packets in the read-only register-bank until the processor tile read them through the AXI4 bus interface; (8) handle acknowledgments and synchronization between the router ports and RISC-V PE. As shown in Fig. 2, CNI is divided into five main modules register-bank, source controller, sink controller, NoC injector, and collector.

  • CNI registers CNI registers are divided into: (1) read-write registers, (2) source data and control registers for packet injection, (3) read-only sink data and control registers for packet collection.

  • CNI controller CNI controller is connected to the router’s local input and output ports, and interfaces the CNI registers and NoC injector and collector. CNI controller is responsible for synchronizing, formatting, dividing the packets into flits, sourcing, and sinking the packets between the processor tile and NoC injector and collector.

  • Network injector and collector The NoC injector and collector manage the transmission and reception of flits and the communications with NoC local ports.

3.1.3 2D and 3D NoC router architectures

Our flexible framework supports different NoC-based configurations utilizing various router architectures to facilitate design space exploration. Below, we present four router architectures: (1) Daniel’s router, (2) distributed/centralized networks-based Conventional Buffer (CB) router, (3) Round-Robin Flexible Buffer (RRFB) router, and (4) distributed/centralized NoC-based Virtual Channel Conventional Buffer (VCCB) router. A brief description of each router is as follows:

  1. (1)

    Daniel’s router [46]. It is an open-source, multi-functional, flit-based, fully synthesizable RTL router. It supports different NoC configurations, Table 2 presents these configurations. We modified the router architecture to support 3D mesh and torus topologies. Two input and output ports for up and down directions are added, and the control unit is modified, as shown in Fig. 3.

  2. (2)

    Conventional buffer (CB) router [47]. It represents the base router. It has input and output ports connected by an intermediate crossbar. It has five/seven ports in 2D/3D NoCs respectively, as shown in Fig. 4. The input port has three main blocks: (1) FIFO buffer: stores the incoming packets from the upstream router, (2) Input controller: handles hand-shacking control from/to the upstream router and communicates with output ports to transfer the received packets, and (3) Routing logic module: applies the routing algorithm to determine the packet destination. The output port has: (1) Arbiter: handles all received requests by input ports connected to the Crossbar and permits bus allocation, (2) Output controller: communicates with the downstream router. At the same time, the Crossbar switches the packets from the upstream router to the downstream router based on the arbiter decision.

  3. (3)

    Round-Robin flexible buffer (RRFB) router [48]. It has the same architecture as the CB router with an additional unit, FIFO Flexibility Controller (FFC), at each input port. It operates similarly to the CB router till congestion occurs. At that time, the flexible router does not wait for free slots at the input-port FIFO. However, the FFC unit searches for a free slot at other ports. Once it finds a free slot, it grants back the request to the upstream router, and the packet is transferred to the selected FIFO port. Afterward, the RRFB router operates normally like the CB router. Figure 4 illustrates the difference between CB and RRFB architectures.

  4. (4)

    Virtual channel conventional buffering (VCCB) router [9]. It adopts virtual channel flow control to improve NoC performance and resolve the congestion, where packets are traveled on a flit basis, and each virtual channel stores flits per packet. The packet is divided into several flits: (1) the head flit contains the source, destination address, and selected flow control, (2) the body flit carries all packet data, and (3) the tail flit indicates the end of the packet.

Fig. 3
figure 3

3D Daniel router basic micro-architecture

Fig. 4
figure 4

2D and 3D micro-architectures for CB and RRFB routers. 2D CB router architecture is illustrated in black code color, 2D RRFB is illustrated in (black + green) code colors, 3D CB router architecture is illustrated in (black + red) code colors, and 3D RRFB router architecture is illustrated in (black + green + red + blue) code colors. (Color figure online)

Table 2 Daniel’s router parameters and configurations

3.1.4 RRFB architecture and deadlock analysis

RRFB router supports dynamic buffering where all FIFOs can store any incoming flit. Such dynamism comes at the expense of deadlocks. Models of deadlock situations are studied; two of them are presented in Fig. 5. For all deadlock model, all buffers are full, and only the head flit is shown. For example, in Fig. 5a, all buffers are full in routers \(R_1\) and \(R_2\), and all head flits of each queue buffer in \(R_1\) are moving east to \(R_2\), while all head flits in \(R_2\) are moving west to \(R_1\) at the exact moment. Another example, but more complex in the 3D NoC, where directions are east, south, or down. Directions mean moving toward positive X, Y, and Z coordinates, respectively. In Fig. 5b, the deadlock cycles are framed in the XY plane; all routers are in the same Z plane. As shown, all flits are made a beeline for either E, N, S, or W. There are two cycles framed, one clockwise (\(\hbox {E}\rightarrow \hbox {S}\rightarrow \hbox {W}\rightarrow \hbox {N}\rightarrow \hbox {E}\)) and another counter-clockwise (\(\hbox {N}\rightarrow \hbox {W}\rightarrow \hbox {S}\rightarrow \hbox {E}\rightarrow \hbox {N}\)). It is clear that no flit could push ahead in any direction because of the deadlock.

Fig. 5
figure 5

RRFB architectures deadlock analysis. In a the simplest deadlock example for RRFB, and more complex examples are shown in (b)

Fig. 6
figure 6

XYZ-based restrictions for RRFB router architecture

Table 3 XYZ-based CB and XYZ-based RRFB restrictions; allowed and forbidden next hop directions

The RRFB architectures developed for solving all such deadlocks models and have the property of abstaining from putting away incoming flits in the related input port buffer under certain conditions and to store the incoming flits in other free buffers in other ports. The deadlock problem is carefully analyzed under (1) XYZ, (2) West-First, and (3) Negative-First routing algorithms, as follows:

  1. (1)

    XYZ-based flexible scheme. The XY turn-model is extended to the 3D-NoC new paths, i.e., up and down. For example, for a set of routers located in the XZ plane with the same Y coordinate, it is forbidden for a flit to move down, then east or west. Also, it is forbidden for a flit to move Up then east or west. The same rules are applied to the routers in the YZ plane. The extended XYZ turn-model is shown in Fig. 6a–c. As shown, there is one forbidden turn to break the deadlock possibility in each cycle. Table 3 presents the restrictions of XYZ-based RRFB router. By applying these restrictions, the XYZ-based RRFB router bans deadlocks under XYZ routing.

  2. (2)

    West First-based flexible scheme. Similarly, the West-First (WF) turn-model [49] is extended to 3D NoCs, turns from any direction followed by a move toward the west are forbidden. For example, in an XY plane, moving north/south then west is forbidden. In the XZ plane, north and south are mapped to up and down, respectively. Then, the same rule is applied; moving from up/down to west is forbidden. For the YZ plane, moving from up/down to the south is forbidden, whereas the south is analogous to the west. The extended WF turn-model is shown in Figs. 7a–c. There is one forbidden turn to break the deadlock possibility in each cycle. These restrictions of the WF-based RRFB router are presented in Table 4.

  3. (3)

    Negative First-based flexible scheme. In the Negative-First (NegF) routing algorithm, turns from negative to positive directions are forbidden. In order to determine the direction sign, the location of the origin (router (0, 0, 0)) must be defined first. East, south, and down are all positive directions, while west, north, and up are negative. Therefore, the following turns are all prohibited; \(\overset{+}{\text {E}} \rightarrow \overset{-}{\text {N}}\), \(\overset{+}{\text {E}}\rightarrow \overset{-}{\text {U}}\), \(\overset{+}{\text {S}}\rightarrow \overset{-}{\text {W}}\), \(\overset{+}{\text {S}}\rightarrow \overset{-}{\text {U}}\), \(\overset{+}{\text {D}}\rightarrow \overset{-}{\text {W}}\), and \(\overset{+}{\text {D}}\rightarrow \overset{-}{\text {N}}\). Table 5 lists the restrictions of NegF-based CB and NegF-based RRFB, that are presented in Figs. 8a–c.

Fig. 7
figure 7

WF-based restrictions for RRFB router architecture

Table 4 WF-based CB and WF-based RRFB restrictions; allowed and forbidden next hop directions
Fig. 8
figure 8

NegF-based restrictions for RRFB router architecture

Table 5 NegF-based CB and NegF-based RRFB restrictions; allowed and forbidden next hop directions
Fig. 9
figure 9

Centralized network-based for CB and RRFB routers architectures

3.1.5 Centralized and distributed NoC-based architectures

Our framework supports centralized and distributed configurations. We modified the routers as follows:

  1. (1)

    CB and RRFB architectures. Centralized NoC-based CB and RRFB architectures are implemented. The routers have the same modules as distributed based-NoC, Fig. 4. However, all ports are connected to the PEs directly, as illustrated in Fig. 9.

  2. (2)

    2D/3D VCCB architectures. Centralized NoC-based VCCB architecture is demonstrated in Fig. 10, each input/output port is connected directly to PE for real traffic pattern acceleration. On the other hand, Fig. 11 illustrates the distributed NoC-based VCCB architecture. A 3D VCCB router architecture is implemented, as depicted in Fig. 11.

Fig. 10
figure 10

Centralized network-based VCCB router architecture

Fig. 11
figure 11

Distributed 2D/3D network-based VCCB router architecture. 2D architecture is shown in only blocks with black color code, and 3D one is shown in both colors code (black and red). (Color figure online)

3.2 UVM and emulation layer

This work explores hardware emulation to verify NoC-based architectures via UVM. The hardware emulation platform, such as Siemens’s Veloce [16] or Synopsys’s ZeBu ASIC [50], enhances evaluation over conventional simulators. It facilitates functional verification and virtual prototyping for complex architectures. It provides accurate measurements for functional behavior and functional coverage of each module [51]. A hardware emulator composes of an array of FPGAs. Initially, the design behavior is described in HDL and synthesized to a gate-level netlist by the RTL compiler/synthesis tool. Then, the design is mapped to a crystal chip, an advanced FPGA with additional memories, control and debug facilities. However, as design size increases, it is mapped to multiple crystal chips on the Advanced Verification Board (AVB). The number of AVBs specifies the emulator capacity.

3.2.1 UVM environment for emulation

We aim to improve the accuracy and accelerate the evaluation for complex, large-scale NoC-based architectures. In this context, Our framework automates design space exploration, where the user sets the target performance and configuration parameters. Then the framework generates corresponding test scenarios and related implementation of UVM and hardware emulation.

Our framework facilitates merging the UVM environment to NoC-based architectures. The UVM environment depends on HDL and Hardware Verification Language (HVL) TOP modules. The HDL TOP module describes the RTL design, and the HVL TOP module describes the UVM testbench environment. The HVL TOP is an untimed, class-based, behavioral, and dynamic architecture. The communication between both TOPs is performed through transaction model-based communication, where:

  • A physical communications link is set between the software-based simulator and hardware emulator that deals in data packets format instead of transaction objects.

  • The traffic amount on the physical link must be controlled to achieve the most profitable execution time.

UVM verification modules are added to the hardware; the synthesis process must pass them. However, the UVM testbench is based on Object-Oriented Programming (OOP) and utilizes SystemVerilog constructs and classes that are not synthesizable and cannot be implemented on the emulator. Thus, these constructs and classes are implemented on software simulation and separated from the synthesizable RTL modules on the hardware emulator. Hence,

  • Bus Functional Model (BFM) is utilized to interface the transactor untimed fragment in the HVL space to the HDL space. The testbench proxy and the corresponding HDL BFM must be interfaced at the Transaction Level Model (TLM) to communicate between the software-based simulator and the hardware emulator [52].

  • Time-plan constructs, including synchronization with a clock or other delays, should be removed as they block performance evaluation of the hardware emulation. They should be implemented using using a synchronous clock model in the HDL time-space [53].

  • The verification environment objects that overcome any obstruction between the RTL and the testbench—i.e., monitors and drivers—should be included in timed and untimed modules.

3.2.2 Proposed UVM architecture for NoC emulation

The proposed emulation UVM environment, shown in Fig. 12, is developed based on our previous related works [54,55,56]. The developed UVM performs two main tasks: (1) performance evaluation for all described routers architectures with different NoC parameters and configurations, (2) function verification and debugging for deadlock cycles and network congestion. Our UVM environment is flexible and generic and supports the AXI interface to automate design space exploration. The user needs to define any configuration (router, PE, NoC sizes, NoC dimension, etc.). Then, corresponding UVM environment will be auto-generated, connected to RTL architecture, and discover the best performance based on user criteria. The developed main UVM components are:

  • Test establishes verification scenarios according to the test plan, connects the DUT to the verification environment through virtual interfaces, and generates the system clock.

  • Environment is the parent of all hierarchical verification modules. It instantiates multiple active and passive agents, including agent configuration objects, subscriber modules like ScoreBoard (SB) and Coverage Collector (CC), and sequences.

  • Agent Each UVM environment could include multiple active/passive agents. The active agent encapsulates sequencer, driver, and monitor modules, while the passive agent has only a monitor module. In the proposed UVM architecture, we have three types of agents:

    1. 1.

      Active source agent drives and monitors the packets on the router local-port.

    2. 2.

      Passive sink agent monitors the signal activities on the router local-port.

    3. 3.

      Passive routing agent monitors the signal activities for the other router ports (east, west, north, and south in 2D).

  • Driver is an active module that drives the input signals of the router local-port. Based on the user configurations for application-based traffic, the driver generates the corresponding injected patterns in hex file format to be loaded to PEs’ RAMs, as shown in Fig. 12. The driver passes the packets with their details into a backdoor access function to inject/collect packets to/from the PE. Backdoor access function forces assigning values to RTL modules or software routines.

  • Monitor captures signals activity of the DUT interface, then transfers them into a transaction level. It has TLM analysis ports to broadcast the captured transaction to other components like subscribers. The proposed UVM environment provides four types of monitors implemented in different agents:

    1. 1.

      Source monitor, located in the active NoC source agents, captures the signals activity from the input local-port for sent packets.

    2. 2.

      Response sink monitor, located in the passive NoC sink agents, captures the signals activity from the output local-ports for the received packets.

    3. 3.

      Response routing monitor, located in passive routing agents, captures the signals activity from the other output ports (east, west, north, and south) to track the packet routing path.

    4. 4.

      Response AXI monitor, located in passive AXI agents, captures the signals activity related to writing/reading packets to/from the CNI.

    All captured data are transacted to the subscribers and collected to figure out the evaluation results.

  • Sequence creates the scenarios sent to the driver in the transaction format. We developed several sequences for synthetic traffic patterns, like uniform, bit-complement, transpose, and application-based traffic patterns like Digital Video Object Plane Decoder (DVOPD) and Moving Picture Experts Group (MPEG4). These sequences randomize delays between packets to give the additional injection and throughput rates.

  • Sequencer connects the sequence with the driver, and generates the data transactions.

  • Subscribers (SB and CC). SB checks and verifies the functionality of the DUT. SB reads the injected data, compares it with received data, and calculates the overall performance results in throughput and latency metrics. It compares the AXI written/read data to/from the CNIs with sent packets to validate the functionality of the PE and CNI. It verifies the route of each packet by tracking it through routing monitors. CC evaluates the testability coverage of the NoC-based MPSoCs by collecting the functional coverage for all scenarios to check if there are any missing scenario.

In brief, the proposed UVM emulation environment is flexible and generic and has the following duties: (1) evaluate the performance of the MPSoC architectures with different parameters and configurations, (2) accelerate evaluation and discover corner-case bugs, (3) examine and catch various deadlock models and defined errors.

Fig. 12
figure 12

Proposed generic UVM environment for NoC emulation

3.3 Software layer

The software layer is responsible for the emulation process to implement and evaluate NoC-based MPSoC architectures. Initially, it parses the user configurations, such as IPs, traffic patterns, NoC size, topologies, etc. Next, It works to auto-build the corresponding UVM environment. Then, it performs the evaluation and result collection. Later, it recommends the best configuration based on the design space. The software layer consists of four modules: (1) NoC and UVM configuration and generation, (2) traffic patterns generator and controller, (3) emulation flow for design-space exploration, and 4) performance analysis, as shown in Fig. 1.

3.3.1 NoC and UVM configuration and generation

A software tool based on Perl scripts is developed to read and parse the user configurations, then auto-generate, build, and connect the implemented NoC-based architecture. Besides, it sets the design space exploration parameters, configures the corresponding UVM environment, and generates the top environment module (Env_Top), Fig. 12.

3.3.2 Traffic patterns generator and controller

The traffic generator injects packets according to the adopted traffic pattern. The test layer supports application-based and synthetic traffic patterns, such as uniform, hot-spot, transpose, bit-shuffle, bit-rotation, bit-reversal, tornado, and neighbor traffic patterns [7].

3.3.3 Emulation flow and design-space exploration

This module performs emulation flow to auto-configure, run, and control the emulation, as described in Fig. 13. The emulation flow is as follows:

  1. 1.

    Select the router architecture from (Daniel, CB, RRFB, VCCB) and set user-defined configurations for NoC parameters.

  2. 2.

    Generate the configured NoC-based MPSoC architecture and its corresponding UVM environment.

  3. 3.

    Compile the injector and collector RISC-V software files for the user-defined traffic pattern.

  4. 4.

    Load the generated data memory of each core into the hardware emulator.

  5. 5.

    Get testbench acceleration co-modeling by running the UVM environment with the generated NoC-based architecture on the emulator, then investigate if there is any congestion or bus routing failure.

  6. 6.

    Emulate NoC-based architecture on the hardware emulator under the pre-configured traffic patterns.

  7. 7.

    Analyze the results in terms of network latency, network throughput, maximum energy, and power consumption [57, 58] and the power overheads consumed in: (a) PE to CNI, and (b) CNI to network input ports.

Fig. 13
figure 13

Proposed emulation flow for our framework

3.3.4 Performance analysis

This module collects results, coverage reports, and SB checkers outputs. It plots the throughput, latency, maximum consumed energy, and power consumption of NoC-based MPSoC. It replicates the emulation process for all user-defined space exploration parameters. It recommends the best findings in terms of configuration parameters to the user based on the required criteria.

4 Experiments and results

We developed four case-studies based on the various router architectures discussed in Sect. 3.1.3:

  • Daniel’s router the emulation is performed for NoC-based architecture with network sizes: \(2\times 2\), \(4\times 4\), \(8\times 8\), \(8\times 16\), buffer sizes: 16, 32, 64, virtual channels number: 1, 2, 3, Traffic Patterns (TP)s: transpose, bit-complement, uniform, 3D topologies: mesh and torus under \(8\times 8\times 4\) network size.

  • CB router the emulation is performed for NoC-based architecture with topologies: mesh, torus, network dimension: 2D, 3D, network architectures: centralized-based and distributed-based. All configurations are under 64 PEs capacity.

  • RRFB router the emulation is performed for NoC-based architecture with routing algorithms: XYZ, West-First (WF), Negative-First (NegF), 2D network architectures: centralized-based and distributed-based. All configurations are under 64 PEs capacity.

  • VCCB router the emulation is performed for NoC-based architectures with 2D/3D networks: centralized-based and distributed-based, with 64 PEs capacity for 2D NoC-based architectures and 256 PEs in 3D NoC-based architectures.

These four case studies clarify the flexibility of our framework to support design space exploration for NoC-based architectures. Different configurations are applied to validate and measure their performance, hence, verifying our framework’s accuracy and scalability. We will focus our discussion of results on NoC performance evaluation under synthetic and real bench-mark traffic patterns and emulation versus simulation speed-up.

4.1 NoC performance evaluation

NoC-based MPSoC architectures with different configurations are implemented and verified to guarantee no bus routing failures or NoC congestion. After, the MPSoC architecture is emulated where co-modeling accelerates the generation of the performance results.

4.1.1 Performance comparison

Experimental results, such as throughput and latency versus injection rates for various topologies, network sizes, network dimensions, network connection architectures, TPs, VCs number, and buffering techniques are shown below.

The performance evaluation of Daniel’s router (first case study) is as follows:

  • NoC sizes Figure 14a, b illustrate the NoC throughput and latency for \(2\times 2\), \(4\times 4\), \(8\times 8\), and \(16\times 8\) NoC size under the configuration of one virtual channel, 64 bits buffer size, and uniform traffic. From the figures, we can notice that small-size NoCs have better throughput and latency compared to larger ones.

  • Buffer sizes Figure 14c, d present the NoC performance for different buffer sizes: 16 bits, 32 bits, and 64 bits under the configuration of \(8\times 8\) NoC size and uniform traffic. From the figures, we notice that increasing the NoC buffer size improves performance. However, it comes at the expense of area and power consumption.

  • Traffic patterns Figure 15a, b show the NoC performance for different traffic patterns: uniform, transpose, and bit complement under the configuration of \(4\times 4\) NoC size and 16-bit buffer size. From the figures, we notice that the transpose traffic pattern has the best performance while the uniform has the worst ones.

  • Number of VCs Figure 15c, d illustrate the NoC performance for various virtual channels per port: 1, 2, and 3 VCs. From the figures, we notice that increasing VCs number improves NoC performance. Similar to the buffer size, it comes at the cost of area and power consumption.

  • Topologies-3D NoC Figure 16a, b present the NoC performance for various topologies of 3D NoCs: \(8\times 8\times 4\) mesh and torus. The figures show that the NoC performance for torus topology is better than mesh topology, and this is because of the wrap-around routing per x, y, and z directions. We can notice that increasing NoC dimensions improves performance at the cost of the area and power consumption.

Fig. 14
figure 14

Throughput and latency performance evaluation for different network and buffer sizes of Daniel router architecture

Fig. 15
figure 15

Throughput and latency performance evaluation for different TPs and number of VCs of Daniel router architecture

Fig. 16
figure 16

Throughput and latency performance evaluation for mesh and torus topologies of 3D Daniel router architecture

The performance evaluation of CB and RRFB router architectures (second and third case studies) is as follows:

  • NoC topologies and router architectures Figure 17a, b illustrate the NoC performance for various topologies: mesh and torus of CB and RRFB 2D-router architectures. The figures show that torus topology has a better performance than mesh. RRFB router architecture provides better performance than the CB architecture due to the adaptability to handle busy ports with congested buffers, as discussed in Sect. 3.1.4.

  • Topologies 3D-NoC Figure 17c, d present the NoC performance for various topologies: mesh and torus of CB and RRFB 3D-router architectures. Figures show that 3D-router architectures provide better performance than the 2D ones, as we illustrated in Sect. 3.1.4.

  • Routing algorithms Figure 18 shows the NoC performance for different routing algorithms (XYZ, WF, NegF) implemented in RRFB 2D-router under the configuration of \(8\times 8\) NoC size, random traffic pattern, and 64-bit buffer size. The figure shows that the XYZ routing algorithm provides the best performance, then WF, and last NegF. XYZ algorithm supports buffering flexibility; there is no constraint on port selection to transfer a packet from source to destination, like WF and NegF algorithms.

Fig. 17
figure 17

In ad throughput and latency performance evaluation of CB and RRFB 2D/3D router architectures

The performance evaluation of VCCB router architecture (fourth case study) is as follows:

  • Router architectures Figure 19 illustrates the NoC performance for centralized and distributed network-based VCCB and CB router architectures under the configuration of 64 PEs capacity, 64-bit buffer size, random traffic pattern, and three VC per port for VCCB routers. The figure shows that VCCB centralized network-based provides the best performance, then VCCB distributed network-based, and the CB centralized comes last. This result is due to the VC flow control, which eases the transfer of the packets quickly. However, the centralized network-based provides better performance than the distributed network-based as there is only one hop latency from the source to the destination but in charge of area and power, which are huge in the centralized network.

  • 3D router architectures Figure 20 presents the NoC performance for the 3D-VCCB router compared with the 3D-Daniel router under the configuration of \(8\times 8\times 4\) NoC size. We found that our proposed architecture of 3D-VCCB provides better performance than the 3D-Daniel router.

Fig. 18
figure 18

Throughput and latency performance evaluation for some various routing algorithms (XYZ, WF, NegF) of RRFB router architecture

Fig. 19
figure 19

Throughput and latency performance evaluation for centralized and distributed networks-based of VCCB and RRFB router architectures

Fig. 20
figure 20

Throughput and latency performance evaluation for 3D distributed networks-based of VCCB compared with 3D-Daniel router architectures

4.1.2 Maximum energy and power per buffer

Buffers contribute more than \(55\%\) of the dynamic power and \(65\%\) of the area of NoCs [59]. Eventually, the total power per buffer is directly proportional to the total number of stored flits in buffers among all routers. Figure 21a, b illustrate the average maximum NoC energy and power per buffer versus the average injection rate for NoC-based architecture under the configuration of 3D-NoC with 27 PEs and router architecture as follows:

  • RRFB 3D-router applying (XYZ, WF, NegF) routing algorithms .

  • Distributed-network based CB 3D-router.

  • Distributed-network based VCCB 3D-router.

  • 3D-Daniel router.

As shown in Fig. 21a, b, the NoC performance in terms of power and energy matches previous results in metrics of throughput and latency. So, as long as the NoC throughput and latency improve, the maximum energy and power per buffer are enhanced. The figures show that the best average maximum consumed energy and power per buffer is achieved by the XYZ routing algorithm based on RRFB router architecture. Since this routing algorithm utilizes the NoC resource efficiently and balances the packets traversing the NoC. On the other hand, the worst average maximum energy and power per buffer are for CB router architecture which applies the basic routing algorithm.

Fig. 21
figure 21

Network average maximum energy and power evaluation for 3D distributed networks-based of VCCB, CB, Daniel, and (XYZ, WF, NegF) different routing algorithms of RRFB routers architectures

4.2 Evaluation under real bench-mark traffic

As mentioned, our framework supports synthetic and application-based traffic patterns. Besides synthetic traffic evaluation, we evaluate the 3D-RRFB and 3D-CB router architectures with real bench-marks traffic patterns of the popular video application; MPGE4 and DVOPD [60] based on Communication Task Graphs (CTG). We model the mapping problem using the discrete optimization language of MiniZinc [61]. We optimize the mapping of MPGE4 and DVOPD applications to reduce the communication cost (defined by the number of hops between every two routers multiplied by the communication bandwidth between them in the graph of communication tasks). The bench-mark traffic patterns are loaded into the memory of the HW emulator for each router, either CB or RRFB architecture, then the same emulation flow, which is illustrated in Sect. 3.3.3, is proceeded to get the performance evaluation. The NoC performance in the metrics of throughput and latency are presented in Figs. 22a, b and 23a, b.

Fig. 22
figure 22

DVOPD performance comparisons between 3D-RRFB and 3D-CB under real bench-mark traffics are shown in (a, b)

Fig. 23
figure 23

MPEG4 performance comparisons between 3D-RRFB and 3D-CB under real bench-mark traffics are shown in (a, b)

4.3 Emulation versus simulation speed-up

We have examined the emulation versus simulation speed-up and capacity performance to assess the high-performance improvements between the two environments. We found an emulation performance gain of average (40X) without sacrificing cycle accuracy.

5 Conclusion

High-performance embedded systems increase the design complexity and the demand for efficient SoC architectures. NoC-based architecture paradigm is considered a solution to deliver the required performance and meet modern embedded system constraints in power, time, and throughput. In order to assess NoC-based architecture performance, verification and emulation become a primary necessity. This paper proposed a flexible and scalable hardware-software verification framework that works in both simulation and hardware emulation utilizing UVM. Emulation and testbench acceleration are completely auto-generated and proceeded in view of a design space exploration flow for various NoC-based architecture configurations employing synthetic and application-based traffic patterns. Many experiments are implemented to assess the correctness and performance of 2D and 3D NoC-based MPSoC. Results show that our framework speeds up performance evaluation by \(40\times \) with respect to software simulators. As a future work, we aim to support an open-source framework and provide a graphical debugging tool.