1 Overview

High-level synthesis (HLS) is the process of automatically compiling a high-level software program (e.g., in C or C++) into a hardware design [1, 2]. HLS aims to increase designer productivity by allowing a higher abstraction level that eases and shortens the hardware design process. Furthermore, it intends to make hardware design available to programmers without hardware design expertise (e.g., software developers who wish to benefit from hardware parallelism) [1].

A standard HLS software-to-hardware flow is outlined in Fig. 8.1. The HLS frontend is a typical software compiler that parses the input code and transforms it into an optimized intermediate compiler representation. The remainder of the flow is hardware-specific: the HLS tool schedules operations of the intermediate representation into clock cycles and determines the required resources to implement the complete circuit; the end result is the description of the circuit at RTL (e.g., VHDL, Verilog) level. The remainder of this section elaborates on this process.

1.1 From Software Program to Intermediate Representation

Like a software compiler, an HLS tool parses the input software code and performs syntax and type checks. It then transforms it into an intermediate representation (IR) that typically describes the program in a graph or assembly form.

Fig. 8.1
An illustration represents the flow from the software to the hardware descriptions going through the compiler frontend, intermediate representation, scheduling, resource sharing, and binding.

High-level synthesis software-to-hardware flow

Fig. 8.2
A block diagram exhibits the data flow from the B B start to B B end going through B B 1, 2, and 3. A snippet of code is denoted on the left.

An example of a compiler intermediate representation, organized into a control/dataflow graph

A standard way to represent producer-consumer relations among IR operations is a dataflow or data dependence graph (DFG). In a DFG, all program operations are represented as nodes and the data dependencies among them as edges. Conversely, a control flow graph (CFG) captures the control flow (i.e., conditional execution) of a program; it consists of basic blocks (BBs), connected by edges that represent the transfer of control from one BB to another. Internally, BBs are straight code sequences without any conditionals; all operations of a BB form a straight DFG that executes only when the condition to enter the BB has been determined [3].

Figure 8.2 shows an example of a program organized in a control/data flow graph (CDFG) which combines the concepts above in a hierarchical manner to describe all control and data dependencies of the original program. The control flow edges, connecting independent BBs, are shown in dashed. A portion of the datapath implementing the loop body is shown in the figure; edges between operations indicate data dependencies (i.e., producer-consumer relations). These concepts specify the execution order of particular operations: a producer operation must execute before its consumer; a BB of operations executes only after the condition to enter it through the appropriate control flow edge has been determined. Thus, they have a key role in scheduling, as we will discuss in the following sections.

The compiler frontend performs a variety of optimizations to make the IR as efficient as possible, thus enabling parallelism opportunities in the later stages of the HLS flow. For instance, it performs code motion to move computations from one CFG portion to another, redundancy elimination to remove the computation of values that have already been computed and can be used later in an unmodified form, and tree balancing to reduce long computational chains into compact structures. Additionally, it analyzes the code to support and enable later optimizations; for instance, liveness analysis determines the liveness of each variable and enables register allocation, memory dependence analysis enables the optimization of memory accesses and construction of efficient memory interfaces, and loop unrolling replicates the loop body for spatial hardware parallelism [3].

Fig. 8.3
A block diagram denotes the input going through the data path, memory, and steering logic, and loops back to the data path to obtain the output. It also exhibits the role of the controller in the diagram.

An HLS-produced circuits is organized into a datapath of operations implementing the functionality of the input program, memory, and steering logic to send data to and from the datapath, and a controller that implements the schedule

1.2 From Intermediate Representation to Hardware Design

Until this point, the program representation was untimed. A central task of HLS is to transform it into a timed representation that specifies the execution time of each event in the resulting hardware implementation. Therefore, the HLS tool schedules the operations of the IR into clock cycles while extracting as much parallelism as possible from the code; simultaneously, it decides on the position of registers to meet the desired clock period target, maps operations onto the available FPGA resources, and defines the circuit interfaces that maximize the memory bandwidth [2].

The resulting circuit is organized as shown in Fig. 8.3:

  1. 1.

    The datapath contains functional units implementing the operations of the original code.

  2. 2.

    The memory elements (i.e., registers) store data items and the steering and multiplexing logic moves the data into the datapath and memories.

  3. 3.

    The controller, typically implemented as a finite state machine, dictates the operation schedule by producing enable signals for the registers and select signals for the multiplexers; it orchestrates the steering of data to and from the circuit (e.g., memory, input, and output ports) at appropriate times. Ultimately, the HLS compiler produces an RTL description of the circuit that can then be passed down to FPGA vendor tools for synthesis, placement, and routing [1].

2 Datapath Scheduling

Scheduling is the process of converting an untimed program representation into a timed representation by assigning each operation of a program to a time slot—typically, described in discrete time units, such as clock cycles. The duration of the clock cycle directly determines the operating frequency of the circuit and the total number of clock cycles determines the execution latency.

The HLS tool devises the operation schedule according to some optimization objective (e.g., minimizing the latency to achieve high performance); as mentioned in Sect. 8.1.2, it subsequently devises a controller that enforces this schedule by triggering operations at appropriate times. The scheduling process can be unconstrained or constrained by a variety of resource, timing, and latency constraints, which complexify the scheduling problem.

2.1 Unconstrained Scheduling

The simplest form of scheduling is without any constraints; we here describe two complementary approaches.

As soon as possible (ASAP). ASAP scheduling aims to schedule operations in the earliest possible time slot, i.e., as soon as all predecessors have been scheduled in some preceding time step, with the goal of minimizing latency [4].

Fig. 8.4
A diagram illustrates the non-scheduled D F G operations. It denotes the connection between multiple nodes labeled from m 1 to m 5, a 1 to a 3, s 1, and c 1.

A non-scheduled DFG of operations

Fig. 8.5
Two diagrams illustrate the A S A P and A L A P schedule of the D F G. They denote the connection between multiple nodes arranged in 4 timespans. The difference between a 2 nodes under A S A Pand A L A P schedules is denoted as stack = 1.

ASAP and ALAP schedule of the DFG in Fig. 8.4

Figure 8.5 shows an example of an ASAP schedule for the DFG of Fig. 8.4. Although not explicit in the figure, in the circuit implementation, each edge between operations will require a register whenever it crosses from one time step to another, to store the data that will be read on the following cycle. The resources (i.e., number of functional units) that the circuit implementation will require are bound by the maximal number of concurrent operations (i.e., operations scheduled in the same time step) of the same type, as they must execute on different functional units; operations executing in different times can reuse the same functional unit, as we will discuss in Sect. 8.2.2.

As late as possible (ALAP). ALAP scheduling is complementary to ASAP: operations are scheduled as late as possible, starting from the sink of the graph and moving toward the earlier time steps; an operation is scheduled as soon as all of its successors have been scheduled [4].

Figure 8.5 contrasts the ALAP schedule with the ASAP schedule of the same graph. Both schedules achieve the best possible (i.e., minimal) latency; some operations are scheduled in the same time step in both schedules, whereas others are scheduled in a later step. The difference in operation start times between the ASAP and the ALAP schedule is referred to as slack. If the slack of an operation is greater than zero, it can be scheduled to another time slot without compromising the latency; a slack of zero indicates that the operation is on a latency-defining path and, thus, its movement would increase the latency. In this example, \({a_2}\) can be moved freely between time slots 2 and 3; however, a movement of \({a_1}\) would shift all succeeding operations and increase the latency to 5. The notion of slack is important when minimizing the resources under a latency constraint: it can be exploited to minimize the number of concurrent operations of the same type without a latency penalty.

Fig. 8.6
A set of lines of text denotes the binary variable at the top, along with the indications of the operation start, sequencing, and resource constraints.

A portion of an integer linear programming scheduling formulation for the example from Fig. 8.5

2.2 Constrained Scheduling

In real-life situations, scheduling can be constrained due to a variety of factors that impact the resulting schedule and its achievable latency. A common scheduling formulation accounts for a fixed number of available resources, thus requiring the latency and area to be traded off in different ways.

Integer linear programming (ILP). An exact scheduling problem is typically formulated as an ILP problem. The constraints are formulated as a system of linear constraints; the objective function minimizes latency under these constraints and the resulting integer values represent the clock cycle in which each operation needs to be scheduled [4].

Figure 8.6 shows examples of ILP constraints, formulated for the graph of Fig. 8.4 and assuming a latency bound of 5. In the equations, \( x_{\textrm{op},t} \) is a binary variable indicating whether operation op starts in time t. The constraints on the operation start times specify that each operation can start only once (in the first equation, \(m_3\) can start in time 1 or in time 2). The sequencing constraints indicate the timing relations between different operations (in the second equation, \(m_3\) must start before its successor \(c_1\)). The resource constraints specify the maximal number of units of the same type in every time step (in the third equation, 2 multipliers). These constraints can be used with different ILP objective functions—for instance, to minimize latency, the objective function minimizes the start times of all operations.

Scheduling under resource constraints is an NP-hard problem; thus, in addition to exact algorithms, there are many approximate ways to identify an acceptable solution efficiently for complex graphs.

Fig. 8.7
Two illustrations denote the list scheduling with the resource constraint of 2 multipliers and 1 multiplier. The constraint of 2 multipliers has 5 time spans, while 1 multiplier has 7 timespans. The initial and updated lists are denoted at the bottom.

List scheduling given a resource constraint of 2 multipliers (left) and 1 multiplier (right)

List scheduling. The idea of list scheduling is to prioritize the scheduling of certain operations based on an urgency metric. Typical examples include the length of the path from the operation to the sink of the graph (where a longer path corresponds to higher urgency) or slack (where a lower slack corresponds to higher urgency).

Figure 8.7 shows a schedule obtained through list scheduling by prioritizing operations on the longest path to the sink and a resource constraint of 2 multipliers (the first two scheduling steps are indicated below the schedule), contrasted with the same scheduling strategy with a constraint of 1 multiplier. Tightening the resource constraint comes at a latency penalty, which is a typical area-performance trade-off of constraint scheduling.

List scheduling is a heuristic approach: the information available in each particular step provides no information on further steps and the potential conflicts that high-priority operations will encounter. Thus, minimal latency is not guaranteed, but its low complexity of O(n) makes it an attractive solution for complex applications [4].

Fig. 8.8
Two illustrations denote the structures of the operation pipelining and operation chaining. It indicates the spans of 1.4 t, 0.6 t, and 1 t under the operation pipelining. The span of 1.4 t is denoted under operation chaining.

Optimizing timing through operation pipelining (left) and chaining (right)

2.3 Timing Optimizations

Another timing aspect that HLS scheduling typically exploits is the ability to adjust the clock period of the circuit by adding or removing registers and trading off latency and the clock period in different ways. As shown in Fig. 8.8, pipelining inserts registers to break operations or paths into multiple time slots to reduce the circuit’s critical path, at a possible latency expense. Conversely, operation chaining fits multiple operations into a single clock cycle to execute combinationally; it saves registers and latency on combinational paths that are not the critical path of the circuit.

These optimizations are, thus, also included in modern HLS scheduling formulations as timing constraints, on top of the resource constraints discussed in the previous section.

Fig. 8.9
An illustration denotes a table under the interval set, the distribution of numbers under the intervals, and a neural network diagram under the conflict graph. The table under the interval set comprises the details of the vertex, left edge, and right edge.

Left-edge algorithm for resource sharing

2.4 Resource Binding and Sharing

Resource binding is the process of mapping operations of the program to physical resources. It can be accompanied by resource sharing to assign a single resource to multiple non-concurrent operations [4]. Binding and sharing can be applied on a scheduled or non-scheduled graph; we here illustrate the problem on a scheduled graph.

To identify which operations are compatible to be implemented on the same resource, operations can be represented using compatibility and conflict graphs. In a compatibility graph, the edges denote compatible, i.e., non-concurrent operation pairs, that can be implemented on the same resource. The sharing problem can then be solved by clique partitioning, where each resulting clique corresponds to a resource instance; optimal sharing is achieved by partitioning into a minimal clique number. The dual problem is to reason about the conflict graph. Edges denote conflicting operations and the sharing problem can be solved by graph coloring; optimal sharing is achieved by coloring with a minimal color number, where each color represents the resource instance (see right of Fig. 8.9).

In general, vertex coloring is an intractable problem. Yet, if the graph is represented as an interval graph, the coloring can be achieved in polynomial time. This is the intuition behind the left-edge algorithm [4] that formulates the sharing problem on an interval graph. The input to the algorithm is a set of execution intervals for each operation. The rationale is to sort the intervals in a list by the left edge (i.e., based on their earliest possible start times) and assign non-overlapping intervals to a single color; when all intervals of a color are exhausted, a new color is introduced and the procedure repeated.

An example is shown in Fig. 8.9. Vertex \(v_1\) is assigned the first color; it is followed by \(v_6\) and \(v_4\) that overlap with the interval of \(v_1\) and, thus, require new colors. All other vertices can be assigned to existing intervals, resulting in a total of three colors (i.e., three functional units, each of them executing the operations of a single color).

Although we here discussed sharing and binding of functional units, the same applies to registers, memories, buses, and other resource types; state-of-the-art binding and scheduling HLS formulations consider all these aspects.

3 Extracting Parallelism Through HLS Scheduling

In this section, we discuss the state-of-the-art HLS scheduling algorithms for FPGAs. We present the concept of system of difference constraints (SDC) modulo scheduling that HLS tools today rely on; we outline polyhedral techniques for memory and loop optimizations. We then discuss the inability of these techniques to handle irregular behavior and more recent solutions to overcome these limitations.

Fig. 8.10
Two illustrations denote the structures of the nonpipelined and pipelined schedules. Both schedules denote the loop iterations under a set of clock cycles. The nonpipelined schedule has 12 cycles and pipelined has 6 cycles. A snippet of code is provided at the bottom.

A non-pipelined (left) and a perfectly pipelined (right) schedule with an initiation interval of 1

3.1 SDC-Based Modulo Scheduling

The techniques of Sect. 8.2 minimize the latency of a single datapath, but they are not sufficient to extract parallelism when datapaths repeat (e.g., in a loop execution): simply executing one iteration after another in a sequential way would result in underutilized datapath resources and low performance, as shown on the left of Fig. 8.10: the total latency corresponds to the sum of latencies of individual datapaths, \(N\cdot Lat\), where N is the number of iterations and Lat the latency of a single datapath.

Loop pipelining is one of the main performance optimization techniques in HLS—it allows the overlapping of loop iterations such that the datapath is used in the best possible way while honoring all data, control, and memory dependencies of the program. Pipelining originates from modulo scheduling techniques for Very Long Instruction Word (VLIW) processors [5], that aim to restructure the code to exploit instruction-level parallelism among loop iterations. As in the case of VLIWs and as discussed before, it is up to the HLS compiler to devise the schedule and create a controller (i.e., a finite state machine) that triggers operations according to this schedule.

A pipeline is characterized by its initiation interval (II), defined as the number of clock cycles between consecutive loop iterations. The best possible II is equal to 1 and indicates that a new iteration starts on every consecutive clock cycle (this is the case for the schedule on the right of Fig. 8.10). The total pipeline latency is now \((N-1) \cdot \textrm{II} + Lat\). The II increases in the presence of data, memory, or control dependencies between iterations, which postpone the start time of the next iteration and thus lower performance. Similarly, if a dependency is undeterminable at compile time, the HLS tool must assume its presence and increase the II accordingly; we will discuss this scenario in Sect. 8.3.3.

State-of-the-art commercial and academic HLS solutions today rely on SDC modulo scheduling [6,7,8] to achieve high-throughput pipelines. The idea is to describe all scheduling constraints as a system of difference constraints in the form \(x - y \le b\), where x and y are integer variables and b is a constant, and formulate a linear programming problem that minimizes the II under these constraints. Such a formulation supports a wide variety of constraints, such as data dependencies among operations, control dependencies between BBs, frequency, latency, and various resource constraints. The SDC scheduling problem is typically solved iteratively: the HLS tool attempts to find a solution for the desired II; in case it is not found, the II is incremented and the scheduling procedure is repeated [7].

3.2 Polyhedral Analysis and Optimization

HLS tools must handle the scheduling of complex programs; thus, they need to describe program features in a compact, parametric, and general way.

Polyhedral analysis is a powerful compiler technique for describing program features such as loop properties (e.g., loop bounds, iterations, and strides) and memory accesses (e.g., memory access patterns and dependencies). It is used to reason about Static Control Parts (SCoPs) of the program, i.e., regions in which all control flow decisions and memory accesses are known at compile time. Within a SCoP, all loops and memory accesses can be described using integer polyhedra [9, 10].

Figure 8.11 shows examples of loops that are SCoPs: they have affine expressions in induction variables and parameters for loop bounds, control flow decisions, and memory accesses. The loops at the bottom of the figure cannot be described as SCoPs nor optimized with polyhedral analysis. In addition to loop properties, polyhedral techniques describe the memory access pattern of each load and store instruction, as illustrated in Fig. 8.12. Together with the schedule, this information is key to identify all read-after-write (RAW), write-after-read (WAR), and write-after-write (WAW) dependencies of the program.

Fig. 8.11
Three snippets of codes with the For loops are provided under no-affine loop increment, indirect memory access, and data-dependent control flow. The indirect memory access denotes the loops that are S C o P s and loops that are not S C o P s.

Loops that can be described as SCoPs and optimized with polyhedral analysis (top), and loops that are not SCoPs (bottom)

Fig. 8.12
A snippet of code with the For loops denotes the calculation with certain conditions. It indicates the iteration domain, access pattern of A, and the accessed memory locations.

A parametric description of the iterations and memory accesses of a SCoP

Program transformations. The goal of polyhedral optimization is to simplify the program and expose parallelism by reordering loops and loop iterations and changing the program’s memory access patterns. The optimizations are performed by applying linear transformations on SCoPs such that the program semantics are preserved: independent loop iterations can be reordered and restructured, but those containing dependent references (e.g., memory dependencies) must be executed in the same order as in the original program [9, 10].

Polyhedral techniques are useful for optimizing HLS programs with statically determinable properties, as they can uncover new parallelism opportunities [10] that scheduling algorithms such as SDC scheduling (see Sect. 8.3.1) can exploit. However, they cannot be applied in general-purpose programs with irregular behaviors that cannot be captured by SCoPs.

Fig. 8.13
Two illustrations represent the structures of the static and dynamic schedulings. Both schedules denote the loop iterations under 10 clock cycles. A code snippet provided on the right indicates the raw dependency.

A static schedule (top), achieved by a standard HLS tool, and a dynamic schedule (bottom), which achieves higher parallelism by resolving memory accesses dynamically at circuit runtime. The schedules are realized by the circuits shown in Fig. 8.14

Fig. 8.14
Two block diagrams depict the structures of the statically and dynamically scheduled circuits. The statically scheduled circuit comprises a static controller. The dynamically scheduled circuit highlights the interactions under the L D weight i.

A portion of a statically scheduled circuit (left) and a dynamically scheduled circuit (right), implementing the behavior of the code in Fig. 8.13

3.3 Dynamic Scheduling

The techniques from the previous sections rely on the HLS compiler to devise the best possible schedule; while this approach is effective when critical information on program execution and behavior is available at compile time, it fails in cases with statically undeterminable memory accesses, variable operation latencies, and unpredictable control flow. In such situations, the HLS tool must assume the worst-case scenario and devise the schedule accordingly, which often results in suboptimal throughput and low performance.

An example of one such situation is shown in Fig. 8.13. The code in the figure has indirect memory accesses to array hist; depending on the values of x, there may or may not be a RAW dependency between a load and a store from a previous iteration. A standard HLS tool must assume the presence of a dependency, devise the appropriate scheduling constraint (similar to the sequencing constraint in Fig. 8.6), and create a conservative schedule (top of the figure), assuming that a dependency is always present.

The most general way to avoid the limitations of static scheduling is to forgo operation triggering through a pre-planned, statically scheduled controller (as shown on the left of Fig. 8.14) that dictates the exact execution time of each operation, but to make scheduling decisions as the circuit runs: as soon as all conditions for execution are satisfied (e.g., the operands are available or critical control decisions are resolved), an operation starts. Dataflow circuits [11,12,13] are a natural method to realize such behavior. They are built out of units that implement latency-insensitivity by communicating with their predecessors and successors through pairs of handshake control signals, which indicate the availability of a new piece of data from the source unit and the readiness of the target unit to accept it (as illustrated on the right of Fig. 8.14). The data is propagated from unit to unit as soon as memory and control dependencies allow it and stalled by the handshaking mechanism otherwise, thus effectively devising a dynamic schedule at circuit runtime. Such a schedule is shown at the bottom of Fig. 8.13: the pipeline is stalled only when a dependency actually exists, otherwise, the loads and stores to array hist may execute out of order for performance benefits.

Several works generate dataflow circuits from functional and imperative software program representations [14,15,16]. The most recent effort in the context of HLS for FPGAs is Dynamatic [16, 17], a complete and open-source HLS compiler that produces high-throughput dataflow circuits from C/C++ code. It incorporates features and compiler transformations to make dataflow circuits truly competitive in the context of modern HLS. The ability to adapt the schedule at runtime offers completely new optimization opportunities: memory dependencies can be resolved at runtime and key control decisions can be speculated on, just like in superscalar processors. Thus, dynamic HLS shows significant speedups when contrasted to state-of-the-art HLS tools [18, 19].

3.3.1 Pipelining and Resource Sharing in the Absence of a Static Schedule

Dataflow circuits must benefit from the same performance and area optimization opportunities as their statically scheduled counterparts; yet, classic scheduling and sharing algorithms described in Sect. 8.2 are not applicable in this context.

In contrast to devising a predetermined pipeline with a fixed II (e.g., by SDC modulo scheduling, as described in Sect. 8.3.1), the performance of dataflow circuits can be optimized via slack matching: inserting pipeline buffers (i.e., FIFOs) of appropriate sizes to prevent stalls and increase parallelism [20, 21], as shown on the left of Fig. 8.15. Slack matching can be combined with frequency optimization to achieve high-throughput synchronous dataflow circuits that honor the required clock period constraint [22, 23]. Similarly, in dataflow circuits, the sharing suitability of operations depends on runtime decisions and schedule adaptations—classic resource sharing techniques that rely on compile time concurrency information (see Sect. 8.2.4) are therefore not applicable. Thus, instead of reasoning about the exact execution times of each operator, dynamic HLS relies on the information on average unit utilization in the steady state of the system and determines what to share accordingly [24]; a sharing implementation of a dataflow unit is shown on the right of Fig. 8.15. These optimizations are key to making dynamic HLS performance- and resource-competitive with static HLS designs.

Fig. 8.15
Two block diagrams exhibit the mechanisms of inserting FIFOs and the implementation of dataflow units.

Pipelining dataflow circuits by inserting FIFOs (left) and a mechanism for sharing dataflow units (right)

Fig. 8.16
Two block diagrams represent the structures of the memory dependency resolution and distributed speculation mechanism. The left one denotes the interactions of load and store with L S Q. The right one denotes the flow through save, speculator, fork, and commit.

A load-store queue for dynamic memory dependency resolution (left) and a distributed speculation mechanism for dataflow circuits (right)

3.3.2 Dynamic Scheduling and Irregular Memory Accesses

When memory dependencies are statically unknown, standard HLS must assume the presence of the dependency—in terms of the scheduling algorithms above, a data dependency constraint conservatively dictates that the two accesses must be sequentialized. In contrast, dataflow circuits determine the presence or absence of a dependency during runtime using load-store queues (LSQs) [25,26,27], such as the one on the left of (Fig. 8.16). They compare the possibly conflicting addresses at runtime and enforce the access order only when they are dependent (e.g., in the example of Fig. 8.13, the LSQ postpones the load of the second iteration until the previous store, that targets the same address, completes); if the accesses are independent, the LSQ allows them to execute out of order (this is the case for the load of the third iteration, that executes before the preceding store). This scheduling flexibility and its performance benefits are impossible in static HLS.

3.3.3 Dynamic Scheduling and Speculative Execution

Speculation is a classic superscalar processor feature that can significantly improve the performance of loops where the loop condition takes a long time to compute by tentatively starting a new loop iteration before the loop condition is known. In static HLS, this optimization is limited to only trivial cases and otherwise hindered by the inability of the static schedule to revert the execution to a prior state in case of a misspeculation. Instead, dataflow circuits support generic forms of speculation [28]: speculative data travels through the circuit and dedicated components implement a distributed squash-and-replay mechanism (as shown on the right of Fig. 8.16), conceptually similar to that of superscalar processors, which achieves high parallelism in control-dominated applications.

3.3.4 VLIWs Versus Superscalars

Dynamic scheduling is in strong contrast to the strategy of Sect. 8.3.1 and in direct analogy to the contrast of VLIW and superscalar processor scheduling: In VLIWs, it is up to the compiler to devise the fixed schedule, which avoids the need to perform dependency checks at runtime (as the schedule guarantees that they are honored) and results in simpler hardware implementations [5, 29]. In contrast, superscalar processors [30] rely on more complex hardware mechanisms to resolve memory and control dependencies at runtime as well as to speculate on critical decisions; this flexibility makes them applicable to a wider variety of situations, which is why they are the generally accepted solution for general-purpose software applications. The situation is the same with static and dynamic scheduling in HLS: static HLS achieves great parallelism in particular application classes, where techniques such as polyhedral analysis and SDC modulo scheduling are successful, but irregular software programs require the flexibility of dynamic scheduling [18].

4 Current Status and Outlook

In this section, we provide an overview of current trends of HLS for FPGAs. We discuss typical usages of HLS and active open-source HLS frameworks, and outline some of the challenges that modern HLS is facing.

4.1 HLS Frameworks

In this section, we provide an overview of recent HLS frameworks targeting FPGAs.

C and OpenCL-based HLS frameworks. Apart from commercial HLS flows for FPGAs, such as AMD (formerly Xilinx) Vitis HLS [31] and Intel HLS [32], numerous open-source HLS projects are under active development. LegUp HLS [33] was originally developed as a complete open-source HLS flow, supporting C++ as well as task-oriented language constructs (e.g., OpenMP and Pthreads) [34]; it has recently been acquired by Microchip and is now closed-source. Bambu [35] is an open-source HLS research framework that supports a variety of FPGA backends and provides support for verification and debugging. Dynamatic [17] is an open-source HLS tool that produces dynamically scheduled dataflow circuits from C/C++ and supports features such as dynamic memory dependency resolution and speculation. DASS [36] has been developed on top of Dynamatic and Vitis HLS to combine the benefits of static and dynamic scheduling.

Compiler infrastructures. A majority of recent HLS flows rely on the well-established LLVM [37] compiler as a frontend to obtain an optimized intermediate representation from C/C++. LLVM provides a single IR describing the program as a CDFG, as described in Sect. 8.1.1; this serves as a starting point to implement either static or dynamic scheduling. Recently, MLIR [38] emerged as an alternative to LLVM; its compiler infrastructure allows the definition and composition of multiple IRs (also referred to as dialects), thus providing modularity and extensibility at different levels of abstraction. The CIRCT project [39] leverages MLIR to provide a variety of hardware-oriented features and abstractions; it incorporates some of the main transformations of Dynamatic, new IRs for hardware (e.g., Calyx [40]), and supports HLS code transformations (e.g., ScaleHLS, [41]). All this, together with a C-based frontend (i.e., Polygeist [42]), will likely serve as a basis for the development of future open-source HLS flows.

Domain-Specific Languages for HLS. Many HLS efforts explore domain-specific languages as an HLS frontend to raise the level of abstraction, increase productivity, and ease the expression of particular domain-specific constructs [43]. Popular DSLs target domains where HLS is successful, such as image processing (e.g., Halide [44,45,46]), machine learning (e.g., HeteroCL [47]), and streaming applications (e.g., Spatial [48]). Most DSLs use an existing C-based HLS flow as a backend and thus ultimately rely on the HLS techniques described in this chapter.

Fig. 8.17
A snippet of code indicates the statements for memory port, maximizing data width from memory, local buffer and burst access, local variable for accumulation, and directives for parallelism.

HLS code restructuring to achieve high parallelism. Reproduced from George et al. [43]

4.2 HLS Code Restructuring and Annotations

Thanks to the HLS frameworks and languages above, HLS is gaining popularity in domains such as machine learning, image processing, graph processing, video transcoding, and networking [49]. However, HLS is still facing critical adoption challenges due to the difficulties of extracting the desired levels of performance: despite the raised programming abstractions, HLS programmers still need to restructure the code and annotate it with pragmas to guide the HLS tool in achieving good parallelism and the desired hardware characteristics. This typically requires significant hardware design expertise and makes HLS unavailable to non-expert users [49, 50].

Consider the example in Fig. 8.17, illustrating a naively written code to add a set of integers held in external memory and store back the result, compared to a restructured code achieving significantly better performance: the restructured code accounts for the data widths, memory interface communication, and other architectural aspects, thus allowing the HLS scheduler to exploit the parallelism available in the computation. This code writing style is not necessarily accessible to software programmers who do not have knowledge of the underlying architecture. Although several works attempt to automate the pragma insertion process [51, 52] and despite the ability of the DSLs to hide many hardware-oriented details, the challenges of HLS programming are still one of the main factors preventing its broad usage.

4.3 Design Space Exploration

All the scheduling possibilities and constraints discussed in the previous sections, as well as the large design space achievable by different restructurings and annotations, create a complex design space with a variety of non-trivial design options: loops can be pipelined with different initiation intervals and unrolled with different factors, resource constraints can be formulated for different FPGA resource types (e.g., DSPs, BRAMs, etc.); the design can be optimized for throughput or latency, as well as tuned to different frequencies. This forms a multi-objective optimization problem that aims to minimize a set of, possibly conflicting, design parameters; the result is a set of points forming a Pareto frontier [53].

Due to the large search space, it is difficult to evaluate design quality and to understand whether the Pareto-optimal points have actually been found or approached. Secondly, the design space exploration itself requires the evaluation of particular points to continue the exploration in the appropriate direction. Some approaches synthesize each point with HLS and evaluate it on the fly (at the expense of long runtimes), whereas others build analytical models to estimate the area and performance (which may be less accurate, but faster) [53, 54]. Finally, different HLS tools have different implementation strategies and design spaces, so it is challenging to directly apply the notions of one tool’s DSE to the other [54]. With a plethora of new HLS techniques and relevant metrics arising, DSE will certainly increase in complexity—the ability to handle it efficiently will be key to navigate through this increasingly complex design space.

4.4 Functional and Formal Verification in HLS

HLS tools typically rely on functional verification of a particular circuit through hardware simulation and software/hardware cosimulation [55]. However, performing exhaustive hardware simulations may become unfeasible or extremely time-consuming as designs increase in complexity. Furthermore, the lack of formal proof on the correctness of particular compilation steps and the resulting hardware modules prevents the adoption of HLS in domains where design iterations are significantly more expensive [56].

Recent efforts aim to formally verify the HLS process [57, 58]. Others employ formal methods to optimize HLS-produced circuits: Cheng et al. use an SMT-based solver to improve the memory arbitration in HLS [59]. Geilen et al. employ model checking for buffering coarse-grain dataflow graphs [60]. Xu et al. use BDD-based reachability analysis [61] and induction [62] to prove particular behavioral properties of HLS-produced circuits and use them to improve their hardware implementation. Such formal methods are key to comprehensively reason about HLS transformations and the resulting circuits.

4.5 Frequency Estimates in HLS

A key task of HLS scheduling is to break combinational paths with registers and ensure that the circuit meets the target clock frequency, as discussed in Sect. 8.2.3. Yet, when placing registers, HLS typically relies on pre-characterized timing information [63] which fails to account for the effects of FPGA synthesis, placement, and routing, causing several undesired effects:

  1. 1.

    The overestimated unit latencies may cause conservative pipelining and unneeded resource overheads (due to the redundant register placements and the prevention of logic optimizations across register-separated pipeline stages).

  2. 2.

    The same conservative pipelining may unnecessarily decrease parallelism and, thus, performance, if a register is redundantly placed on a throughput-critical cycle.

  3. 3.

    During placement and routing, the backend of the FPGA flow can introduce delay variations caused by interconnect delays that are difficult to estimate and cause frequency discrepancies from the target [49].

Recent works extend HLS scheduling formulations, such as the ones in Sect. 8.3, with physical design objectives; they aim to make HLS optimizations aware of the physical layout of the FPGA by including LUT mapping information into the scheduling problem [63,64,65] and estimating routing congestion and interconnect delays [66, 67]. This information is critical to advance HLS design quality and make them comparable to hand-optimized RTL [49, 50].