1 Introduction

Electronic System Level (ESL) design space exploration in general and the validation of ESL designs in particular typically rely on dynamic model simulation in order to study and optimize a desired system or electronic device before it is actually being built. For this simulation, an abstract model of the intended system is first specified in a System-Level Description Language (SLDL) such as SpecC [13] or SystemC [15]. This SLDL model is then compiled, executed, and evaluated on a host computer. In order to reflect causal ordering and provide abstract timing information, SLDL simulation algorithms are usually based on the classic Discrete Event (DE) semantics which drive the simulation process forward by use of events and simulation time advances.

As an IEEE standard [18], the SystemC SLDL [15] is predominantly used in both industry and academia. Under the umbrella of the Accellera Systems Initiative [1], the SystemC Language Working Group [31] maintains not only the official SystemC language definition but also provides an open-source proof-of-concept library [16] that can be used free of charge to simulate SystemC design models within a standard C++ programming environment. However, this reference implementation follows the traditional scheme of sequential Discrete Event Simulation (DES) which is much easier to implement than truly parallel approaches. Here, the discrete time events generated during the simulation are processed strictly in sequential order. As such, the SystemC reference simulator runs fully sequentially and cannot utilize any parallel computing resources available on today’s multi- and many-core processor hosts. This severely limits the simulation speed since a single processor core has to perform all the work.

1.1 Exploiting Parallelism for Higher Simulation Speed

In order to provide faster simulation performance, parallel execution is highly desirable, especially since the SLDL model itself already contains clearly exposed parallelism that is explicitly specified by the language constructs, such as SC_METHOD, SC_THREAD, and SC_CTHREAD in the case of SystemC SLDL. Here, the topic of Parallel Discrete Event Simulation (PDES) [12] has recently gained a lot of attraction again. Generally, the PDES approach issues multiple threads concurrently and runs these threads in parallel on the multiple available processor cores. In turn, the simulation speed increases significantly. The naive assumption of a linear speedup of factor nx for n available processor cores occurs very rarely due to the usually quite limited amount of parallelism exposed in the model and the application of Amdahl’s law [2]. Typically, a speedup of one order of magnitude can be expected for embedded system applications with parallel modules [6].

With respect to PDES performance, it is very important to understand the dependencies among the discrete time events and also to distinguish between the dependencies present in the model and the (additional) dependencies imposed by the simulation algorithm. In general, SLDL models assume a partial order of events where certain events depend on others due to a causal relationship or occurrence in sequence along the time line, and other events are independent and may be performed in any order or in parallel. On top of this required partial ordering of events in the model, a sequential simulator (described in Sect. 17.2) imposes a much stronger total order on the event processing. This is acceptable as it observes the weaker SLDL execution semantics but results in slow performance due to the overly restrictive sequential execution.

In contrast, PDES follows a partial ordering of events. Here, we can distinguish between regular synchronous PDES , which advances the (global) simulation time in order and executes threads in lockstep parallel fashion, and aggressive asynchronous PDES , where simulation time is effectively localized and independent threads can execute out of order and run ahead of others. As we will explain in detail in Sect. 17.3, the strict in-order execution imposed by synchronous PDES still limits the opportunities for parallel execution. When a thread finishes its evaluation phase, it has to wait until all other threads have completed their evaluation phases as well. Threads finishing early must stop at the simulation cycle barrier, and available processor cores are left idle until all runnable threads reach the cycle barrier.

This problem is overcome by the latest state-of-the-art approaches which implement a novel technique called Out-of-Order Parallel Discrete Event Simulation (OOO PDES) [6, 8]. By internally localizing the simulation time to individual threads and carefully processing dependent events only at required times, an OOO PDES simulator can issue threads in parallel and ahead of time, exploiting the maximum available parallelism without loss of accuracy. Thus, the OOO PDES presented in Sect. 17.4 minimizes the idle time of the available parallel processor cores and results in highest simulation speed while fully maintaining the standard SLDL and unmodified model semantics.

We should note that parallel simulation in general, and synchronous PDES and OOO PDES in particular, can be implemented in different environments and are not language dependent. In fact, the parallel simulation techniques discussed in Sects. 17.3 and 17.4 have been originally designed using the SpecC language [10, 13, 35] and integrated into the system-on-chip environment (see Chap. 31, “SCE: System-on-Chip Environment” and in particular Sect. 4.1, “Simulation”) but can be equally well applied to other SLDLs with explicitly exposed parallelism. Without loss of generality, and in order to present the state of the art, we describe in this chapter the evolution from sequential DES over synchronous PDES to asynchronous OOO PDES using the SystemC SLDL [15, 18] which is both the de facto and official standard for ESL design today. In particular, we describe in Sect. 17.5 our Recoding Infrastructure for SystemC (RISC) [21], which is open source [20] and consists of a dedicated SystemC compiler and corresponding out-of-order parallel simulator, and thus provides a reference implementation of OOO PDES for SystemC.

Finally, we conclude this chapter in Sect. 17.6 with experimental results for embedded application examples that demonstrate the significantly reduced simulator run times due to parallel simulation.

1.2 Related Work on Accelerated Simulation

Parallel simulation is a well-studied subject in the literature [4, 12, 24]. Two major synchronization paradigms can be distinguished, namely, conservative and optimistic [12]. Conservative PDES typically involves dependency analysis and ensures in-order execution for dependent threads. In contrast, the optimistic paradigm assumes that threads are safe to execute and rolls back when this proves incorrect. Often, the temporal barriers in the model prevent effective parallelism in conservative PDES while rollbacks in optimistic PDES are expensive in execution.

The OOO PDES presented in Sect. 17.4 is conservative and can be seen as an improvement over synchronous PDES (Sect. 17.3) approaches, such as [11, 27] for SystemC and [7] for SpecC.

Distributed parallel simulation, including [4, 17], is a natural extension of PDES where the system model is partitioned into sets of modules and distributed onto a set of networked host computers. Here, the necessary model partitioning may create extra work, and often the network speed becomes a bottleneck for synchronization and communication among the multiple host simulators.

In a wider perspective, simulation performance can also be improved by clever software modeling and utilizing specialized hardware. Software approaches include the overall concept of transaction-level modeling (TLM) [3] and temporal decoupling [32, 33], which abstract communication from a pin-accurate level to entire transactions with only approximate timing, and source-level [30] or host-compiled simulation [14], which model computation and scheduling at a higher abstraction level. Typically, these modeling techniques trade off higher simulation speed at the cost of reduced timing accuracy.

Simulation acceleration can also be achieved by special purpose hardware, such as field-programmable gate arrays (FPGA) and graphics processing units (GPU). For example, [29] emulates SystemC code on FPGA boards and [23] proposes a SystemC multi-threading model on GPUs. As a hybrid approach, [28] parallelizes SystemC simulation across multiple CPUs and GPUs. Usually, special hardware approaches pose limitations on the model partitioning across the heterogeneous simulator units.

Parallel simulation can also be organized as multiple simulators which run independently in parallel and synchronize regularly or when needed (i.e., when timing differences grow beyond a given threshold). For instance, the Wisconsin Wind Tunnel [22] uses a conservative time bucket synchronization scheme to coordinate multiple simulators at predefined intervals. Another example [34] introduces a simulation backplane to handle the synchronization between wrapped simulators and analyzes the system to optimize the period of synchronization message transfers.

From this discussion, we can see that parallel simulation is an important topic that has been addressed and optimized with a multitude of approaches. For the remainder of this book chapter, we will focus on the SystemC language and present two general parallel simulation techniques, namely, synchronous PDES in Sect. 17.3 and an advanced out-of-order PDES algorithm in Sect. 17.4. Both approaches are general in the sense that they do not require any special setup or hardware, pose no limitations on the simulation model, and do not sacrifice any timing accuracy in the result.

2 Discrete Event Simulation (DES)

Before we address the details of modern parallel simulation, we review in this section the classic sequential simulation of models with discrete time. After the description of this traditional DES algorithm, we then extend it incrementally to synchronous PDES in Sect. 17.3 and finally to out-of-order PDES in Sect. 17.4.

2.1 Discrete Time and Discrete Event Model

SLDL simulation is driven by discrete events and simulation time advances. We will use the term simulation time for the simulated time maintained by the simulator. This must not be confused with simulator run time which is the actual wall-clock time that measures how long it takes the simulator to perform the simulation on the host. Simulation time consists of a tuple (t, δ) where t represents an integral amount of simulated time since the simulation start t = 0 and δ is a positive integer that counts iterations at the same simulated time t due to event notifications.

Formally, time tuples, often referred to as time stamps, form a partial order and can be compared as follows [6]:

  • equal: (t 1, δ 1) = (t 2, δ 2)  iff t 1 = t 2, δ 1 = δ 2

  • before: (t 1, δ 1) < (t 2, δ 2)  iff t 1 < t 2, or t 1 = t 2, δ 1 < δ 2

  • after: (t 1, δ 1) > (t 2, δ 2)  iff t 1 > t 2, or t 1 = t 2, δ 1 > δ 2

Events serve the purpose of synchronization among communicating or dependent threads. Threads can notify events or wait for events. An event notification at time (t, δ) reaches a thread waiting for the event at the same time, or it expires without any effect if no thread is waiting for the event. If a thread wakes up due to a notified event, it resumes its execution at the next delta increment (t, δ + 1) or immediately at (t, δ) in case of a SystemC immediate notification.

In other words, the SLDL semantics use an outer cycle in the simulation process to model time advances so that thread execution can reflect estimated duration or delays by adding to time t. In addition, there is an inner cycle in the simulator, called delta cycle , that is used for event notifications which may in turn wake waiting threads, in which case time is incremented by one δ count. In the case of SystemC, yet another innermost cycle is available where so-called immediate event notifications may take place without any time advance. Since immediate notifications can easily lead to nondeterministic models with potential deadlocks or lost events, these should generally be avoided.

2.2 Scheduling Queues

SLDLs use a set of parallel threads to execute the functionality in the model. In SystemC, such threads are explicitly specified as SC_METHOD, SC_THREAD, or SC_CTHREAD. These threads are then managed by a scheduler in the simulator kernel which decides when threads are actually dispatched to run.

To coordinate the execution of the threads according to the SLDL semantics, the simulation scheduler typically maintains sorted lists or queues of threads where each thread is a member of one queue at any time. For ease of understanding, we use a simplified formal model here and ignore special cases in the SystemC language, such as suspended processes. Formally we define the following scheduling sets [9]:

  • THREADS = READYRUNWAITWAITTIME

  • READY = { thth is ready to run }

  • RUN = { thth is currently running }

  • WAIT = { thth is waiting for one or more events }

  • WAITTIME = { thth is waiting for time advance }

At the beginning of simulation at time (0, 0), all threads are placed into the READY queue and the other sets are empty:

  • THREADS = READY

  • \(\textsf{RUN} = \textsf{WAIT} = \textsf{WAITTIME} = \varnothing\)

During simulation, the scheduler moves threads between the queues and suspends or resumes their execution according to their state. In order to describe the discrete event scheduling algorithms below, we formally define the following scheduling operations on threads th maintained in queues A and B:

  • Run(th): thread th begins or resumes its execution

  • Stop(): the current thread stops running or suspends its execution

  • Yield(th): the current thread stops running and yields execution to thread th

  • th = Pick(A): pick one thread th out of set A

  • Move(th, A, B): move thread th from set A to set B

For inter-thread synchronization and communication, the SystemC language provides events and channel primitives with special update semantics. Without going into the details of event notifications and channel updates, we denote the set of instantiated primitive channels in the model as CHANNEL.

2.3 Sequential Discrete Event Scheduler

Figure 17.1 shows the traditional DES scheduling algorithm as it is implemented by the Accellera SystemC reference simulator [16]. Most notably, this algorithm is fully sequential in the sense that only a single thread is made runnable at all times. In SystemC specifically, the choice of the next thread to run is nondeterministic by definition, so one thread is randomly picked from the READY queue and placed into the RUN queue for execution. When the thread returns to the scheduler due to execution of a wait statement, it yields control back to the scheduler which in turn picks the next thread to run.

Fig. 17.1
figure 1

Traditional discrete event simulation (DES) scheduler for SystemC

When the READY queue is empty, the scheduler performs requested channel updates and event notifications which typically fills the READY queue again with threads that wake up due to events they were waiting for. These are taken out of the WAIT queue and a new delta cycle begins.

If no threads become ready after the update and notification phase, the current time cycle is complete. Then the scheduler advances the simulation time, moves all threads with the earliest next time stamp from the WAITTIME queue into the READY queue, and resumes execution with the next time cycle.

Finally, when both the READY and WAITTIME queues are empty, the simulation terminates.

3 Parallel Discrete Event Simulation (PDES)

The sequential DES algorithm can be easily extended to support synchronous parallel simulation. Instead of a single thread in DES, regular PDES manages multiple threads at the same time in the RUN queue. These threads can then execute truly in parallel on the parallel processors of the host.

3.1 Parallel Discrete Event Scheduler

The PDES scheduling algorithm, as shown in Fig. 17.2, operates the same way as the traditional scheduler in Fig. 17.1, but with one exception: the synchronous parallel scheduler picks multiple threads from the READY queue and runs them in parallel on the available processor cores.

Fig. 17.2
figure 2

Synchronous parallel discrete event simulation (PDES) scheduler for SystemC

In the evaluation phase, as long as the READY queue is not empty and an idle core is available, the PDES scheduler issues a new thread from the READY queue in a loop. When a thread finishes earlier than the other threads in the same cycle, a new ready thread is picked and assigned to the free processor core. Thus, within the same delta cycle, PDES keeps as many processor cores as busy as possible.

However, we notice that only threads at the same time (t, δ) run in parallel. Synchronous PDES implies an absolute barrier at the end of each delta and time cycle. All threads need to wait at the barrier until all other runnable threads finish their current evaluation phase. Only then the synchronous scheduler performs the channel update and event notification phase for the next delta or advances simulation time for the next time cycle.

3.2 Protection of the Parallel Simulation Kernel

The benefit of PDES running more than a single thread at the same time comes at a price. Explicit synchronization among the parallel threads becomes necessary in critical sections. In particular, shared data structures in the simulation kernel, including the global simulation time, event lists, and thread queues, need to be properly protected for mutual exclusive access by the concurrent threads.

In order to protect the central scheduling resources, locks (binary semaphores) and condition variables need to be introduced for proper thread synchronization. For example, our RISC prototype implementation (see Sect. 17.5 below) uses one dedicated kernel lock to protect the scheduling resources. This lock is acquired by the threads at every kernel entry and released upon kernel exit. Each thread also owns a condition variable c that is used in combination with the lock to put threads to sleep (Stop() calls wait(c)) or wake them up (Run(th) calls signal(c)), as determined by the scheduling kernel.

3.3 Preserving SystemC Execution Semantics in PDES

In contrast to the SpecC language, which allows preemptive parallel execution of the threads in the model [10], the SystemC language poses strict rules on standard-compliant simulation. This is a very important aspect to consider when applying PDES to SystemC. For semantics-compliant SystemC simulation, complex inter-dependency analysis over all threads and variables in the model is a prerequisite to parallel execution [9]. The IEEE standard SystemC Language Reference Manual (LRM) [18] clearly states that “process instances execute without interruption” and presumably is meant to simplify the writing of SystemC models. Here, the need to prevent parallel access conflicts to shared variables and to avoid potential race conditions among the parallel threads becomes a burden for the simulation environment (rather than for the model designer).

This requirement is also known as cooperative (or coroutine) multitasking which is explicitly assumed by the SystemC execution semantics. As detailed in [9], the particular problem of parallel simulation is also explicitly addressed in the SystemC LRM [18]:

An implementation running on a machine that provides hardware support for concurrent processes may permit two or more processes to run concurrently, provided that the behavior appears identical to the coroutine semantics defined in this subclause. In other words, the implementation would be obliged to analyze any dependencies between processes and constrain their execution to match the coroutine semantics.

Consequently, a standard-compliant PDES environment for SystemC must identify and resolve any dependencies among the threads in the model. We will describe this required dependency analysis in detail in Sect. 17.5.3 because it is needed for both synchronous and out-of-order PDES.

4 Out-of-Order Parallel Discrete Event Simulation (OOO PDES)

In OOO PDES [6], we break the synchronous barrier in the simulator so that independent threads can also run in parallel when they are at a different simulation times (different t or different δ). In other words, threads are allowed to run ahead in time and thus can execute out-of-order, unless a causal relationship prohibits it.

4.1 Thread-Local Simulation Time

For OOO PDES, we replace the global simulation time (t, δ) with local time stamps for each thread. Thus, each thread th maintains its own time (t th , δ th ).

Events get assigned their own time, too. Since events in the simulation model can occur multiple times and at different simulation times, we note an event e notified at time (t, δ) as a triple (id e , t e , δ e ). Thus, every event is managed with its own ID and notification time attached.

Finally, we distinguish the sets of events that have been notified at a given time. Formally, we define:

  • EVENTS = ∪EVENTS t, δ

  • EVENTS t, δ = {(id e , t e , δ e ) |  t e = t, δ e = δ)}

4.2 Dynamically Evolving Scheduling Queues

Rather than the static DES queues which exist at all times, we define for OOO PDES multiple sets ordered by their local time stamps and dynamically create and delete these sets as needed. For efficiency reasons, these sets are typically implemented as true queues where the threads are ordered by increasing time stamps. Formally, we define the following queues:

  • QUEUES = {READY, RUN, WAIT, WAITTIME}

  • READY = ∪READY t, δ , where READY t, δ = {th |  th is ready to run at (t, δ)}

  • RUN = ∪RUN t, δ , where RUN t, δ = {th |  th is running at (t, δ)}

  • WAIT = ∪WAIT t, δ , where WAIT t, δ = {th |  th is waiting since (t, δ) for events(id e , t e , δ e ), (t e , δ e ) ≥ (t, δ)}

  • WAITTIME = ∪WAITTIME t, δ , where δ = 0, WAITTIME t, δ = {th |  th is waiting forsimulation time advance to (t, 0)}

As in the regular DES case, the simulation starts at time (0, 0) with all threads in the READY 0,0 queue. Then again the threads change state by transitioning between the queues, as determined by the scheduler:

  • Move(th, READY t, δ , RUN t, δ ): thread th is issued and becomes runnable

  • Move(th, RUN t, δ , WAIT t, δ ): thread th calls wait(e) for an event e

  • \(\mathbf{Move}(\mathit{th},\textsf{RUN}_{t,\delta },\textsf{WAITTIME}_{t^{{\prime}},0})\), where t < t = t + d: thread th calls wait(d) to wait for a time delay d

  • \(\mathbf{Move}(\mathit{th},\textsf{WAIT}_{t,\delta },\textsf{READY}_{t^{{\prime}},\delta ^{{\prime\prime}}})\), where (t, δ) ≤ (t , δ ): thread th is waiting since time (t, δ) for event e = (id e , t e , δ e ) which is notified at time (t , δ ); in turn, thread th becomes ready to run at (t , δ ) where δ = δ (immediate notification) or δ = δ + 1 (regular delta cycle notification)

  • Move(th, WAITTIME t, δ , READY t, δ ), where δ = 0: simulation time advances to time (t, 0), making one or more threads th ready to run; the local time for these threads th is set to (t th , δ th ) where t th = t and δ th = 0

Whenever the sets READY t, δ and RUN t, δ become empty and there are no \(\textsf{WAIT}_{t^{{\prime}},\delta ^{{\prime}}}\) or \(\textsf{WAITTIME}_{t^{{\prime}},\delta ^{{\prime}}}\) queues with earlier time stamps (t , δ ) ≤ (t, δ), then the scheduler can delete these sets as well as any expired events EVENTS t, δ .

4.3 Out-of-Order Parallel Discrete Event Scheduler

Figure 17.3 shows the OOO PDES scheduling algorithm. Since each thread maintains its own local time, the scheduler can relax the nested loops structure of synchronous PDES and deliver events and update simulation times individually, providing more flexibility for threads to run in parallel. Overall, this results in a higher degree of parallelism and thus higher simulation speed.

Fig. 17.3
figure 3

Out-of-order parallel discrete event simulation (OOO PDES) scheduler for SystemC

Note that the prior loops for explicit delta cycles and time cycles in the scheduler control flow do not exist any more for OOO PDES. Instead, we only have one main loop where both the notification phase and time updates are processed. The READY queue is consequently filled with more threads which, however, are now subject to possible conflicts. These conflicts are then taken into account when threads are picked from the READY queue and issued for execution into the RUN set.

Note also that the WAITTIME queue gets cleared in every scheduling step and all the threads move into the timed READY queue. Then, when the scheduler picks ready threads to run, it prefers earlier ones over threads with later time stamps. This order prevents threads from starving in the READY queue and also minimizes conflicts among the ready threads.

Potential conflicts are strictly averted by the NoConflicts(th) condition in Fig. 17.3 when runnable threads are picked. Here, detailed dependency analysis is used to avoid potential data, event, and time advance hazards among the set of threads in RUN that are executing in parallel. Only if a thread th has NoConflicts(th) it can be issued for parallel execution.

4.4 OOO PDES Scheduling Algorithm

Algorithm 3 formally defines the scheduling algorithm of OOO PDES . At each scheduling step, the scheduler first evaluates notified events and wakes up corresponding threads from WAIT. If a thread receives its event e with time stamp (t e , δ e ), it becomes ready to run and its local time advances to either (t e , δ e ) for an immediate notification or (t e , δ e + 1) for regular delta notifications.

Algorithm 3 OOO PDES scheduling algorithm

After event notifications, the scheduler processes local time advances and moves any threads in WAITTIME to the READY t, 0 queue corresponding to their wait time.

Then the scheduler issues threads for parallel execution as long as idle CPU cores and threads without conflicts are available. Finally, if no threads can run, that is when RUN = READY = ∅, the simulator terminates.

Note that Algorithm 3 allows to enable/disable the parallel out-of-order execution at any time by setting the Cores parameter. For example, when in-order execution is needed for debugging purposes, we can set Cores to 1, and the algorithm will behave the same way as the traditional DES where only one thread is running in order at all times.

Algorithm 4 Conflict detection in OOO PDES scheduler

OOO PDES relies heavily on efficient conflict detection. At run time, the scheduler calls the function NoConflicts(th) listed in Algorithm 4. NoConflicts(th) checks for potential conflicts with all concurrent threads in the RUN and READY queues that run at an earlier time than the candidate thread th. For each concurrent thread, function Conflict(th 1, th 2) checks for any data, time, and event hazards. We will explain these hazards and their analysis in detail in Sect. 17.5.3 below, because we can rely on the compiler to carry the heavy burden of this complex analysis and pass prepared conflict tables to the simulator. At run time, the scheduler can then perform these checks in constant time (O(1)) by use of table lookups.

5 Recoding Infrastructure for SystemC (RISC)

We have realized the OOO PDES approach for the SystemC language with our RISC . This proof-of-concept prototype environment consists of a compiler and simulator with examples and documentation. The RISC software package is available as open source on our website [20] and can be installed on any regular multi-core Linux host.

To perform semantics-compliant parallel SystemC simulation with out-of-order scheduling, we introduce a dedicated SystemC compiler that works hand in hand with a new simulator. This is in contrast to the traditional SystemC simulation flow where a SystemC-agnostic C++ compiler includes the SystemC headers and links the input model directly against the reference SystemC library.

As shown in Fig. 17.4, our RISC compiler acts as a frontend that processes the input SystemC model and generates an intermediate model with special instrumentation for OOO PDES. The instrumented parallel model is then linked against the extended RISC SystemC library by the target compiler (a regular C++ compiler) to produce the final executable output model. OOO PDES is then performed simply by running the generated executable model.

Fig. 17.4
figure 4

RISC compiler and simulator for out-of-order PDES of SystemC

From the user perspective, we simply replace the regular C++ compiler with the SystemC-aware RISC compiler (which in turn calls the underlying C++ compiler). Otherwise, the overall SystemC validation flow remains the same as before. It will be just faster due to the parallel simulation.

Internally, the RISC compiler performs three major tasks, namely, segment graph construction, conflict analysis, and source code instrumentation.

5.1 Segment Graph

The first task of the RISC compiler is to parse the SystemC input model into an Abstract Syntax Tree (AST) . Since SystemC is syntactically regular C++ code, RISC relies here on the ROSE compiler infrastructure [25]. The ROSE internal representation (IR) provides RISC with a powerful C/C++ compiler foundation that supports AST generation, traversal, analysis, and transformation.

As illustrated with the RISC software stack shown in Fig. 17.5a, the RISC compiler then builds on top of the ROSE IR a SystemC internal representation which accurately reflects the SystemC structures, including the module and channel hierarchy, port connectivity, and other SystemC-specific constructs. Up until this layer, the RISC software stack is very similar to the SystemC-clang framework [19].

Fig. 17.5
figure 5

Recoding infrastructure for SystemC (RISC) and a segment graph (SG) example [26]. (a) RISC software stack. (b) Example source code. (c) Segment graph [wait]

On top of this, the RISC compiler then builds a segment graph data structure. A Segment Graph (SG) [8] is a directed graph that represents the code segments executed by the threads in the model. With respect to SystemC simulation, these segments are separated by the scheduler entry points, i.e., the wait statements in the SystemC code. In other words, the discrete events in the SystemC execution semantics are explicitly reflected in the SG as segment boundaries.

Note that a general segment graph may be defined with different segment boundaries. In fact, the RISC infrastructure takes the segment boundary as a flexible parameter that may be set to any construct found in the code, including function calls or control flow statements, such as if, while, or return. Here, we use wait statements as boundary (specified as “segment graph [wait]” in Fig. 17.5c) since for the purpose of parallel simulation we are interested in the segments of code executed between two scheduling points.

Formally, a segment graph consists of nodes and edges [6]. While the nodes are defined by the specified boundaries, the edges in the SG are defined by the possible control flow transitions. A transition exists between two segment nodes S 1 and S 2 if the flow of control starting from segment S 1 can reach segment S 2.

For example, the SystemC source code in Fig. 17.5b results in the SG shown in Fig. 17.5c where the segment boundary is chosen as SystemC wait. Here, the control flow from the start can reach the two wait statements in lines 2 and 6, resulting in the two edges to segment “wait(line2)” and “wait(line6).” Note also that source code lines may become part of multiple segments. Here, the assignment z=z*z is part of both segments “wait(line2)” and “wait(line6)” because it can be reached from both nodes.

5.2 Segment Graph Construction

The automatic construction of a segment graph is a complex process for the compiler. In this section, we first outline the main aspects and then provide a formal algorithm for the SG generation.

In contrast to the initial SpecC-based implementation [6, 8] (which is part of Chap. 31, “SCE: System-on-Chip Environment”) which has had several limitations, the RISC SG generator can build a graph from any given scope in a C/C++-based code and the user can freely choose the segment boundaries (as stated above for a general SG). There are also no control flow limitations. The RISC compiler fully supports recursive functions, jump statements break and continue, as well as multiple return statements from functions. Finally, expressions with an undefined evaluation order will be properly ordered to avoid ambiguity.

Algorithm 5 Segment graph generation

Algorithm 5 formally defines the central function BuildSG used in the RISC SG generator. Function BuildSG is recursive and builds the graph of segments by traversing the AST of the design model. Here, the first parameter CurrStmt is the current statement which is processed next. The set CurrSegs contains the current set of segments that lead to CurrStmt and thus will incorporate the current statement. For instance, while processing the assignment z=z*z in Fig. 17.5, CurrSegs is the set {wait(line2), wait(line6)}, so the expression will be added to both segments.

If CurrStmt is a boundary statement (e.g., wait), a new segment is added to CurrSegs with corresponding transition edges (lines 2–7 in Algorithm 5). Compound statements are processed by recursively iterating over the enclosed statements (lines 8–12), and conditional statements are processed recursively for each possible flow of control (from line 13). For example, the break and continue statements represent an unconditional jump in the program. For handling these keywords, the segments in CurrSegs move into the associated set BreakSegs or ContSegs, respectively. After completing the corresponding loop or switch statement, the segments in BreakSegs or ContSegs will be moved back to the CurrSegs set.

For brevity, we illustrate the processing of function calls and loops in Fig. 17.6. The analysis of function calls is shown in Fig. 17.6a. In step 1 the dashed circle represents the segment set CurrSegs. The RISC algorithm detects the function call expression and checks if the function is already analyzed. If not and it is encountered for the first time, the function is analyzed separately, as shown in step 2. Otherwise, the algorithm reuses the cached SG for the particular function. Then in step 3, each expression in segment 1 of the function is joined with each individual segment in CurrSegs (set 0). Finally, segments 4 and 5 represent the new set CurrSegs.

Fig. 17.6
figure 6

Segment graph generation for functions and loops. (a) Function call processing. (b) Loop processing

Correspondingly, Fig. 17.6b illustrates the SG analysis for a while loop. Again the dashed circle in step 1 represents the incoming set CurrSegs. The algorithm detects the while statement and analyzes the loop body separately. The graph for the body of the loop is shown in step 2. Then each expression in segment 1 is joined into the segment set 0, and the new set CurrSegs becomes the joined set of 0+1, 4, and 5. Note that we have to consider set 0+1 for the case that the loop is not taken.

5.3 Static Conflict Analysis

The segment graph data structure serves as the foundation for static (compile time) conflict analysis. As outlined earlier, the OOO PDES scheduler must ensure that every new running thread is conflict-free with respect to any other threads in the READY and RUN queues. For this, we utilize the RISC compiler to detect any possible conflicts already at compile time.

Potential conflicts in SystemC include data hazards, event hazards, and timing hazards, all of which may exist among the current segments executed by the threads considered for parallel execution [6].

5.3.1 Data Hazards

Data hazards are caused by parallel or out-of-order accesses to shared variables. Three cases exist, namely, read after write (RAW), write after read (WAR), and write after write (WAW).

In the example in Fig. 17.7, if the simulator would issue the threads th 1 and th 2 in parallel, this would create a race condition, making the final value of s nondeterministic. Thus, the scheduler must not run th 1 and th 2 out of order. Note, however, that th 1 and th 2 can run in parallel in the segment after their second wait statement if the functions f() and g() are independent.

Fig. 17.7
figure 7

Example of WAW conflict: Two parallel threads th 1 and th 2 start at the same time but write to the shared variable s at different times. Simulation semantics require that th 1 executes first and sets s to 0 at time (5, 0), followed by th 2 setting s to its final value 1 at time (10, 0)

Since, data hazards stem from the code in specific segments, RISC analyzes data conflicts statically at compile time and creates a table where the scheduler can then at run time quickly look up any potential conflicts between active segments.

Formally, we define a data conflict table CT[N, N] where N is the total number of segments in the application model: CT[i, j] = true, iff there is a potential data conflict between the segments seg i and seg j ; otherwise, CT[i, j] = false.

To build the conflict table, the compiler generates for each segment a variable access list which contains all variables accessed in the segment. Each entry is a tuple (Symbol, AccessType) where Symbol is the variable and AccessType specifies read only (R), write only (W), read write (RW), or pointer access (Ptr).

Finally, the compiler produces the conflict table CT[N, N] by comparing the access lists for each segment pair. If two segments seg i and seg j share any variable with access type (W) or (RW), or there is any pointer access by seg i or seg j , then this is marked as a potential conflict.

Figure 17.8 shows an example SystemC model where two threads th 1 and th 2 which run in parallel in modules M1 and M2, respectively. Both threads write to the global variable x, th 1 in lines 14 and 27, and th 2 in line 35 since reference p is mapped to x. Before we can mark these WAW conflicts in the data conflict table, we need to generate the segment graph for this example. The SG with corresponding source code lines is shown in Fig. 17.9a, whereas Fig. 17.9b shows the variable accesses by the segments. Note that segments 3 and 4 of thread th 1 write to x, as well as segment 8 of th 2 which writes to x via the reference p. Thus, segments 3, 4, and 8 have a WAW conflict. This is marked properly in the corresponding data conflict table shown in Fig. 17.10a.

Fig. 17.8
figure 8

SystemC example with two parallel threads in modules M1 and M2

Fig. 17.9
figure 9

Segment graphs generated by RISC for the example in Fig. 17.8. (a) Source code SG for Fig. 17.8. (b) Variable access SG for Fig. 17.8

Fig. 17.10
figure 10

Data conflict and event notification tables for the example in Fig. 17.8. (a) Data conflict table. (b) Event notification table

In general, not all variables are straightforward to analyze statically. SystemC models can contain variables at different scopes, as well as ports which are connected by port maps. The RISC compiler distinguishes and supports the following cases for the static variable access analysis.

  1. 1.

    Global variables, e.g., x, y in lines 2 and 3 of Fig. 17.8: This case is discussed above and is handled directly as tuple (Symbol, AccessType).

  2. 2.

    Local variables, e.g., temp in line 10 for Module M1: Local variables are stored on the stack and cannot be shared between different threads. Thus, they can be ignored in the variable access analysis.

  3. 3.

    Instance member variables, e.g., i in line 2 for Module M2: Since classes can be instantiated multiple times and then their variables are different, we need to distinguish them by their complete instance path added to the variable name. For example, the actual symbol used for the instance variable i in module M2 is m. m2. i.

  4. 4.

    References, e.g., p in line 23 in module M2: RISC follows references through the module hierarchy and determines the actual mapped target variable. For instance, p is mapped to the global variable x via the mapping in line 43.

  5. 5.

    Pointers: RISC currently does not perform pointer analysis. This is planned as future work. For now, RISC conservatively marks all segments with pointer accesses as potential conflict with all other segments.

5.3.2 Event Hazards

Thread synchronization through event notification also poses hazards to out-of-order execution. Specifically, thread segments are dependent when one is waiting for an event notified by another.

We define an event notification table NT[N, N] where N is the total number of segments: NT[i, j] = true, iff segment seg i notifies an event that seg j is waiting for; otherwise, NT[i, j] = false. Note that in contrast to the data conflict table, the event notification table is not symmetric.

Figure 17.10b shows the event notification table for the SystemC example in Fig. 17.8. For instance, NT[7, 3] = true since segment 7 notifies the event e in line 32, which segment 3 is waiting for in line 13.

Note that in order to identify event instances properly, RISC uses the same scope and port map handling for events as described above for data variables.

5.3.3 Timing Hazards

The local time for an individual thread in OOO PDES can pose a timing hazard when the thread runs too far ahead of others. To prevent this, we analyze the minimum time advances of threads at segment boundaries. For SystemC, there are three cases with different time increments, as listed in Table 17.1.

Table 17.1 Time advances at wait segment boundaries

In order for the scheduler to avoid timing hazards, we let the compiler prepare two time advance tables, one for the segment a thread is currently in and one for the next segment(s) that a thread can reach in the following scheduling step.

Fig. 17.11
figure 11

Current and next time advance tables for the example in Fig. 17.8. (a) Current time advance table. (b) Next time advance table

The current time advance table CTime[N] lists the time increment that a thread will experience when it enters the given segment. For the SystemC example in Figs. 17.8 and 17.11a shows the corresponding current time advance table. Here for instance, the wait(2,SC_MS) in line 30 at the beginning of segment 7 defines CTime[7] = (2, 0).

On the other hand, the next time advance table NTime[N] lists the time increment that a thread will incur when it leaves the given and enters the next segment. Since there may be more than one next segment, we list in the table the minimum of the time advances, which is the earliest time the thread can become active again. Formally: \(\mathit{NTime}[i] = min\{\mathit{CTime}[j],\forall \mathit{seg}_{j}\text{ which follow }\mathit{seg}_{i}\}\).

For example, Fig. 17.11b lists NTime[0] = (1, 0) since segment 0 is followed by segment 2 with increment (1, 0) and segment 4 with increment .

There are two types of timing hazards, namely, direct and indirect ones. For a candidate thread th 1 to be issued, a direct timing hazard exists when another thread th 2, that is safe to run, resumes its execution at a time earlier than the local time of th 1. In this case, the future of th 2 is unknown and could potentially affect th 1. Thus, it is not safe to issue th 1.

Table 17.2 shows an example where a thread th 1 is considered for execution at time (10: 2). If there is a thread th 2 with local time (10: 0) whose next segment runs at time (10: 1), i.e., before th 1, then the execution of th 1 is not safe. However, if we know from the time advance tables that th 2 will resume its execution later at (12: 0), no timing hazard exists with respect to th 2.

Table 17.2 Examples for direct and indirect timing hazards

An indirect timing hazard exists, if a third thread th 3 can wake up earlier than th 1 due to an event notified by th 2. Again, Table 17.2 shows an example. If th 2 at time(10: 0) potentially wakes a thread th 3 so that th 3 runs in the next delta cycle (10: 1), i.e., earlier than th 1, then it is not safe to issue th 1.

5.4 Source Code Instrumentation

As shown above in Fig. 17.4 on page 31, the RISC compiler and parallel simulator work closely together. The compiler performs the complex conservative static analysis and passes the analysis results to the simulator which then can make safe scheduling decisions quickly.

More specifically, the RISC compiler passes all the generated conflict tables to the simulator, namely, the data conflict table, the event notification table, as well as the current and next time advance tables. In addition, the compiler instruments the model source code so that the simulator can properly identify each thread and each segment by unique numeric IDs.

To pass information from the compiler to the simulator, RISC uses automatic model instrumentation. That is, the intermediate model generated by the compiler contains instrumented (automatically generated) source code which the simulator then can rely on. At the same time, the RISC compiler also instruments user-defined SystemC channels with automatic protection against race conditions among communicating threads. The source code instrumentation with segment IDs, conflict tables, and automatic channel protection is a part of model “recoding” (i.e., the “R” in RISC). For the future, we envision additional recoding tasks performed by RISC, such as model transformation, optimization, and refinement.

In total, the RISC source code instrumentation includes four major components:

  1. 1.

    Segment and instance IDs: Individual threads are uniquely identified by a creator instance ID and their current code location (segment ID). Both IDs are passed into the simulator kernel as additional arguments to all scheduler entry functions, including wait calls and thread creation.

  2. 2.

    Data and event conflict tables: Segment concurrency hazards due to potential data conflicts, event conflicts, or timing conflicts are provided to the simulator as two-dimensional tables indexed by a segment ID and instance ID pair. For efficiency, these table entries are filtered for scope, instance path, and reference and port mappings.

  3. 3.

    Current and next time advance tables: The simulator can make better scheduling decisions by looking ahead in time if it can predict the possible future thread states. This possible optimization is discussed in detail in [5] but remains as a future work item for the current RISC prototype.

  4. 4.

    User-defined channel protection: SystemC allows the user to design channels for custom inter-thread communication. To ensure that such user-defined communication remains safe also in the OOO PDES situation where threads execute truly in parallel and out of order, the RISC compiler automatically inserts locks (binary semaphores) into these user-defined channel instances (which are acquired at entry and released upon leaving) so that mutually exclusive execution of the channel methods is guaranteed. Otherwise, race conditions could exist when communicating threads exchange data.

After this automatic source code instrumentation, the RISC compiler passes the generated intermediate model to the underlying regular C++ compiler which produces the final simulator executable by linking the instrumented code against the RISC extended SystemC library.

6 Experimental Evaluation

We now present two SystemC application models as examples and evaluate the performance of DES , PDES , and OOO PDES algorithms on modern multi-core hosts. As DES representative and baseline reference, we will use the open-source proof-of-concept library [16] provided by the SystemC Language Working Group [31] of the Accellera Systems Initiative [1]. As OOO PDES representative, we will use the RISC [21] which is also available as open source [20]. For synchronous PDES, we will use an in-house version of RISC where the out-of-order scheduling features are disabled. To ensure a fair comparison, all simulator packages are based on Posix threads and compiled with the same optimization settings, and of course run on the same host environment.

6.1 Conceptual DVD Player Example

Our first example is an abstract model of a DVD player, as shown in Fig. 17.12. While this SystemC model is conceptual only, it is well-motivated and very educational as it clearly demonstrates the differences between the DES, PDES, and OOO PDES algorithms.

Fig. 17.12
figure 12

Conceptual DVD player example with a video and two audio stream decoders

As listed in Fig. 17.12, the SystemC modules representing the video and audio decoders operate in an infinite loop, reading a frame from the input stream, decoding it, and sending it out to the corresponding monitor modules. Since the video and audio frames are data independent, the decoders run in parallel and output the decoded frames according to their channel rate, 30 frames per second video (delay 33.33 ms) and 38.28 frames per second audio (delay 26.12 ms), respectively.

Figure 17.13 depicts the time lines of simulating the DVD player according to DES, PDES, and OOO PDES semantics. As expected, the DES schedule in Fig. 17.13a executes only a single task at all times. Synchronous PDES in Fig. 17.13b is able to parallelize the decoders for the left and right audio channels but cannot exploit parallelism for decoding the video channel due to its different frame rate. Only the OOO PDES schedule in Fig. 17.13c shows the fully parallel execution that we also expect in reality. Note that the artificially discretized timing in the model prevents PDES from full parallelization. In contrast, OOO PDES with thread-local timing achieves the goal.

Fig. 17.13
figure 13

Time lines for the DVD player example during simulation. (a) DES schedule: sequential execution, only one task at any time. (b) Synchronous PDES schedule: parallel execution only at the same simulation times. (c) Out-of-order PDES schedule: fully parallel execution (the same as in reality!)

Our experimental measurements listed in Table 17.3 confirm the analysis of Fig. 17.13. For both experiments, the synchronous PDES gains about 50% simulation speed over the reference DES. However, the out-of-order PDES beats the synchronous approach by another 100% improvement.

Table 17.3 Experimental results for the DVD player example. RISC V0.2.1 simulator performance (Posix-thread based) on a 8-core IntelⓇ XeonⓇ host PC (E3-1240 CPU, 4 cores, 2 hyper-threads) at 3.4 GHz

6.2 Mandelbrot Renderer Example

As a representative example of very computation intensive and highly parallel applications, we have evaluated the three-simulation algorithms also on a graphics pipeline model that renders a sequence of images of the Mandelbrot set. Figure 17.14 shows the block diagram of our Mandelbrot renderer model in SystemC.

Fig. 17.14
figure 14

Mandelbrot renderer example: Stimulus generates target coordinates in the complex plane for which the Design Under Test (DUT) renders the corresponding Mandelbrot image and sends it to the Monitor module. In the DUT, a Coordinator distributes slices of the target coordinates to parallel Mandelbrot worker threads and synchronizes their progress

The number of Mandelbrot worker threads is user configurable as an exponent of 2, for example, 4 as illustrated in Fig. 17.14. Each worker thread computes a different horizontal slice of the image and works independently and in parallel to the others. If enabled in the SystemC model, the progress of the workers’ computation is displayed in a window, as shown in Fig. 17.15.

Fig. 17.15
figure 15

Screenshot of the Mandelbrot set renderer in action: The progress of the parallel threads, which collaboratively compute a Mandelbrot set image, can be viewed live in a window on screen. Here, eight SystemC threads compute the eight horizontal slices of the image in parallel. Note that this visualization clearly shows the difference between the sequential DES simulation (which computes only one slice at any time) and the parallel PDES algorithms (which are shown here)

Table 17.4 shows the measured experimental results for the Mandelbrot renderer with different numbers of worker threads, one per image slice, as indicated in the first column. Since the example is “embarrassingly parallel” in the DUT and otherwise contains only comparatively little sequential computation, both parallel simulators respond with impressive performance speedups, more than 20× compared to the Accellera reference simulator. With growing parallelism, the simulation speed increases almost linearly with respect to the number of parallel workers, up until to the point where the number of software threads reaches the number of available hardware threads (2 CPUs, 8 cores with 2 hyper-threads each, so 32 in total).

Table 17.4 Experimental results for the Mandelbrot renderer example. RISC V0.2.1 simulator performance (Posix-thread based) on a 32-core IntelⓇ XeonⓇ host PC (2 E5-2680 CPUs, 8 cores, 2 hyper-threads) at 2.7 GHz

We can also observe that, for this example, the synchronous PDES and the OOO PDES perform the same since the differences are within the noise range of the measurement accuracy. This matches the expectation, because the Mandelbrot workers are all synchronized (locked in) due to their communication with the coordinator thread and thus out-of-order execution cannot be exploited here.

As for scalability, PDES and OOO PDES approaches scale very well, if we base our expectation on the host hardware capabilities and the amount of parallelism exposed in the application model (which is the fundamental limitation of any PDES). Overall, we observe that parallel simulation has the potential to improve simulation speed by an order of magnitude or more. In this book chapter we do not evaluate the overhead of the static analysis incurred at compile time because our current compiler implementation does not produce meaningful results due to its unoptimized ROSE foundation. Generally, compile time for OOO PDES increases moderately, but is amortized by the typically longer and more frequent simulations [6].

7 Conclusion

In the era of stagnant processor clock frequencies and the growing availability of inexpensive multi- and many-core architectures, parallel simulation is a must-have for the efficient validation and exploration of embedded system-level design models. Consequently, the traditional purely sequential DES approach, as provided by the open-source SystemC reference simulator, is inadequate for the future when the system complexity keeps growing at its current exponential pace.

In this chapter, we have reviewed the classic discrete event-based simulation techniques with a focus on state-of-the-art parallel solutions (synchronous and out-of-order PDES) that are particularly suited for the de facto and official IEEE standard SystemC . While many approaches have been proposed in the research community, we have detailed the OOO PDES approach [6] being developed in the Recoding Infrastructure for SystemC (RISC) [21]. The open-source RISC project provides a dedicated SystemC compiler and corresponding out-of-order parallel simulator as proof-of-concept implementation to the research community and industry.

The OOO PDES technology stands out from other approaches as an aggressive yet conservative modern simulation approach beyond traditional PDES, because it can exploit parallel computing resources to the maximum extend and thus achieves fastest simulation speed. At the same time, it can preserve compliance with traditional SystemC semantics and support legacy models without modification or loss of accuracy.