Abstract
High-level synthesis (HLS) is the process of compiling a software program into a digital circuit. This chapter provides a view into the HLS design flow and presents algorithms, tools, and methods to generate digital circuits from software descriptions. It details FPGA-oriented HLS techniques, discusses recent HLS advancements, and outlines the current challenges of HLS for FPGAs.
Access provided by Autonomous University of Puebla. Download chapter PDF
1 Overview
High-level synthesis (HLS) is the process of automatically compiling a high-level software program (e.g., in C or C++) into a hardware design [1, 2]. HLS aims to increase designer productivity by allowing a higher abstraction level that eases and shortens the hardware design process. Furthermore, it intends to make hardware design available to programmers without hardware design expertise (e.g., software developers who wish to benefit from hardware parallelism) [1].
A standard HLS software-to-hardware flow is outlined in Fig. 8.1. The HLS frontend is a typical software compiler that parses the input code and transforms it into an optimized intermediate compiler representation. The remainder of the flow is hardware-specific: the HLS tool schedules operations of the intermediate representation into clock cycles and determines the required resources to implement the complete circuit; the end result is the description of the circuit at RTL (e.g., VHDL, Verilog) level. The remainder of this section elaborates on this process.
1.1 From Software Program to Intermediate Representation
Like a software compiler, an HLS tool parses the input software code and performs syntax and type checks. It then transforms it into an intermediate representation (IR) that typically describes the program in a graph or assembly form.
A standard way to represent producer-consumer relations among IR operations is a dataflow or data dependence graph (DFG). In a DFG, all program operations are represented as nodes and the data dependencies among them as edges. Conversely, a control flow graph (CFG) captures the control flow (i.e., conditional execution) of a program; it consists of basic blocks (BBs), connected by edges that represent the transfer of control from one BB to another. Internally, BBs are straight code sequences without any conditionals; all operations of a BB form a straight DFG that executes only when the condition to enter the BB has been determined [3].
Figure 8.2 shows an example of a program organized in a control/data flow graph (CDFG) which combines the concepts above in a hierarchical manner to describe all control and data dependencies of the original program. The control flow edges, connecting independent BBs, are shown in dashed. A portion of the datapath implementing the loop body is shown in the figure; edges between operations indicate data dependencies (i.e., producer-consumer relations). These concepts specify the execution order of particular operations: a producer operation must execute before its consumer; a BB of operations executes only after the condition to enter it through the appropriate control flow edge has been determined. Thus, they have a key role in scheduling, as we will discuss in the following sections.
The compiler frontend performs a variety of optimizations to make the IR as efficient as possible, thus enabling parallelism opportunities in the later stages of the HLS flow. For instance, it performs code motion to move computations from one CFG portion to another, redundancy elimination to remove the computation of values that have already been computed and can be used later in an unmodified form, and tree balancing to reduce long computational chains into compact structures. Additionally, it analyzes the code to support and enable later optimizations; for instance, liveness analysis determines the liveness of each variable and enables register allocation, memory dependence analysis enables the optimization of memory accesses and construction of efficient memory interfaces, and loop unrolling replicates the loop body for spatial hardware parallelism [3].
1.2 From Intermediate Representation to Hardware Design
Until this point, the program representation was untimed. A central task of HLS is to transform it into a timed representation that specifies the execution time of each event in the resulting hardware implementation. Therefore, the HLS tool schedules the operations of the IR into clock cycles while extracting as much parallelism as possible from the code; simultaneously, it decides on the position of registers to meet the desired clock period target, maps operations onto the available FPGA resources, and defines the circuit interfaces that maximize the memory bandwidth [2].
The resulting circuit is organized as shown in Fig. 8.3:
-
1.
The datapath contains functional units implementing the operations of the original code.
-
2.
The memory elements (i.e., registers) store data items and the steering and multiplexing logic moves the data into the datapath and memories.
-
3.
The controller, typically implemented as a finite state machine, dictates the operation schedule by producing enable signals for the registers and select signals for the multiplexers; it orchestrates the steering of data to and from the circuit (e.g., memory, input, and output ports) at appropriate times. Ultimately, the HLS compiler produces an RTL description of the circuit that can then be passed down to FPGA vendor tools for synthesis, placement, and routing [1].
2 Datapath Scheduling
Scheduling is the process of converting an untimed program representation into a timed representation by assigning each operation of a program to a time slot—typically, described in discrete time units, such as clock cycles. The duration of the clock cycle directly determines the operating frequency of the circuit and the total number of clock cycles determines the execution latency.
The HLS tool devises the operation schedule according to some optimization objective (e.g., minimizing the latency to achieve high performance); as mentioned in Sect. 8.1.2, it subsequently devises a controller that enforces this schedule by triggering operations at appropriate times. The scheduling process can be unconstrained or constrained by a variety of resource, timing, and latency constraints, which complexify the scheduling problem.
2.1 Unconstrained Scheduling
The simplest form of scheduling is without any constraints; we here describe two complementary approaches.
As soon as possible (ASAP). ASAP scheduling aims to schedule operations in the earliest possible time slot, i.e., as soon as all predecessors have been scheduled in some preceding time step, with the goal of minimizing latency [4].
Figure 8.5 shows an example of an ASAP schedule for the DFG of Fig. 8.4. Although not explicit in the figure, in the circuit implementation, each edge between operations will require a register whenever it crosses from one time step to another, to store the data that will be read on the following cycle. The resources (i.e., number of functional units) that the circuit implementation will require are bound by the maximal number of concurrent operations (i.e., operations scheduled in the same time step) of the same type, as they must execute on different functional units; operations executing in different times can reuse the same functional unit, as we will discuss in Sect. 8.2.2.
As late as possible (ALAP). ALAP scheduling is complementary to ASAP: operations are scheduled as late as possible, starting from the sink of the graph and moving toward the earlier time steps; an operation is scheduled as soon as all of its successors have been scheduled [4].
Figure 8.5 contrasts the ALAP schedule with the ASAP schedule of the same graph. Both schedules achieve the best possible (i.e., minimal) latency; some operations are scheduled in the same time step in both schedules, whereas others are scheduled in a later step. The difference in operation start times between the ASAP and the ALAP schedule is referred to as slack. If the slack of an operation is greater than zero, it can be scheduled to another time slot without compromising the latency; a slack of zero indicates that the operation is on a latency-defining path and, thus, its movement would increase the latency. In this example, \({a_2}\) can be moved freely between time slots 2 and 3; however, a movement of \({a_1}\) would shift all succeeding operations and increase the latency to 5. The notion of slack is important when minimizing the resources under a latency constraint: it can be exploited to minimize the number of concurrent operations of the same type without a latency penalty.
2.2 Constrained Scheduling
In real-life situations, scheduling can be constrained due to a variety of factors that impact the resulting schedule and its achievable latency. A common scheduling formulation accounts for a fixed number of available resources, thus requiring the latency and area to be traded off in different ways.
Integer linear programming (ILP). An exact scheduling problem is typically formulated as an ILP problem. The constraints are formulated as a system of linear constraints; the objective function minimizes latency under these constraints and the resulting integer values represent the clock cycle in which each operation needs to be scheduled [4].
Figure 8.6 shows examples of ILP constraints, formulated for the graph of Fig. 8.4 and assuming a latency bound of 5. In the equations, \( x_{\textrm{op},t} \) is a binary variable indicating whether operation op starts in time t. The constraints on the operation start times specify that each operation can start only once (in the first equation, \(m_3\) can start in time 1 or in time 2). The sequencing constraints indicate the timing relations between different operations (in the second equation, \(m_3\) must start before its successor \(c_1\)). The resource constraints specify the maximal number of units of the same type in every time step (in the third equation, 2 multipliers). These constraints can be used with different ILP objective functions—for instance, to minimize latency, the objective function minimizes the start times of all operations.
Scheduling under resource constraints is an NP-hard problem; thus, in addition to exact algorithms, there are many approximate ways to identify an acceptable solution efficiently for complex graphs.
List scheduling. The idea of list scheduling is to prioritize the scheduling of certain operations based on an urgency metric. Typical examples include the length of the path from the operation to the sink of the graph (where a longer path corresponds to higher urgency) or slack (where a lower slack corresponds to higher urgency).
Figure 8.7 shows a schedule obtained through list scheduling by prioritizing operations on the longest path to the sink and a resource constraint of 2 multipliers (the first two scheduling steps are indicated below the schedule), contrasted with the same scheduling strategy with a constraint of 1 multiplier. Tightening the resource constraint comes at a latency penalty, which is a typical area-performance trade-off of constraint scheduling.
List scheduling is a heuristic approach: the information available in each particular step provides no information on further steps and the potential conflicts that high-priority operations will encounter. Thus, minimal latency is not guaranteed, but its low complexity of O(n) makes it an attractive solution for complex applications [4].
2.3 Timing Optimizations
Another timing aspect that HLS scheduling typically exploits is the ability to adjust the clock period of the circuit by adding or removing registers and trading off latency and the clock period in different ways. As shown in Fig. 8.8, pipelining inserts registers to break operations or paths into multiple time slots to reduce the circuit’s critical path, at a possible latency expense. Conversely, operation chaining fits multiple operations into a single clock cycle to execute combinationally; it saves registers and latency on combinational paths that are not the critical path of the circuit.
These optimizations are, thus, also included in modern HLS scheduling formulations as timing constraints, on top of the resource constraints discussed in the previous section.
2.4 Resource Binding and Sharing
Resource binding is the process of mapping operations of the program to physical resources. It can be accompanied by resource sharing to assign a single resource to multiple non-concurrent operations [4]. Binding and sharing can be applied on a scheduled or non-scheduled graph; we here illustrate the problem on a scheduled graph.
To identify which operations are compatible to be implemented on the same resource, operations can be represented using compatibility and conflict graphs. In a compatibility graph, the edges denote compatible, i.e., non-concurrent operation pairs, that can be implemented on the same resource. The sharing problem can then be solved by clique partitioning, where each resulting clique corresponds to a resource instance; optimal sharing is achieved by partitioning into a minimal clique number. The dual problem is to reason about the conflict graph. Edges denote conflicting operations and the sharing problem can be solved by graph coloring; optimal sharing is achieved by coloring with a minimal color number, where each color represents the resource instance (see right of Fig. 8.9).
In general, vertex coloring is an intractable problem. Yet, if the graph is represented as an interval graph, the coloring can be achieved in polynomial time. This is the intuition behind the left-edge algorithm [4] that formulates the sharing problem on an interval graph. The input to the algorithm is a set of execution intervals for each operation. The rationale is to sort the intervals in a list by the left edge (i.e., based on their earliest possible start times) and assign non-overlapping intervals to a single color; when all intervals of a color are exhausted, a new color is introduced and the procedure repeated.
An example is shown in Fig. 8.9. Vertex \(v_1\) is assigned the first color; it is followed by \(v_6\) and \(v_4\) that overlap with the interval of \(v_1\) and, thus, require new colors. All other vertices can be assigned to existing intervals, resulting in a total of three colors (i.e., three functional units, each of them executing the operations of a single color).
Although we here discussed sharing and binding of functional units, the same applies to registers, memories, buses, and other resource types; state-of-the-art binding and scheduling HLS formulations consider all these aspects.
3 Extracting Parallelism Through HLS Scheduling
In this section, we discuss the state-of-the-art HLS scheduling algorithms for FPGAs. We present the concept of system of difference constraints (SDC) modulo scheduling that HLS tools today rely on; we outline polyhedral techniques for memory and loop optimizations. We then discuss the inability of these techniques to handle irregular behavior and more recent solutions to overcome these limitations.
3.1 SDC-Based Modulo Scheduling
The techniques of Sect. 8.2 minimize the latency of a single datapath, but they are not sufficient to extract parallelism when datapaths repeat (e.g., in a loop execution): simply executing one iteration after another in a sequential way would result in underutilized datapath resources and low performance, as shown on the left of Fig. 8.10: the total latency corresponds to the sum of latencies of individual datapaths, \(N\cdot Lat\), where N is the number of iterations and Lat the latency of a single datapath.
Loop pipelining is one of the main performance optimization techniques in HLS—it allows the overlapping of loop iterations such that the datapath is used in the best possible way while honoring all data, control, and memory dependencies of the program. Pipelining originates from modulo scheduling techniques for Very Long Instruction Word (VLIW) processors [5], that aim to restructure the code to exploit instruction-level parallelism among loop iterations. As in the case of VLIWs and as discussed before, it is up to the HLS compiler to devise the schedule and create a controller (i.e., a finite state machine) that triggers operations according to this schedule.
A pipeline is characterized by its initiation interval (II), defined as the number of clock cycles between consecutive loop iterations. The best possible II is equal to 1 and indicates that a new iteration starts on every consecutive clock cycle (this is the case for the schedule on the right of Fig. 8.10). The total pipeline latency is now \((N-1) \cdot \textrm{II} + Lat\). The II increases in the presence of data, memory, or control dependencies between iterations, which postpone the start time of the next iteration and thus lower performance. Similarly, if a dependency is undeterminable at compile time, the HLS tool must assume its presence and increase the II accordingly; we will discuss this scenario in Sect. 8.3.3.
State-of-the-art commercial and academic HLS solutions today rely on SDC modulo scheduling [6,7,8] to achieve high-throughput pipelines. The idea is to describe all scheduling constraints as a system of difference constraints in the form \(x - y \le b\), where x and y are integer variables and b is a constant, and formulate a linear programming problem that minimizes the II under these constraints. Such a formulation supports a wide variety of constraints, such as data dependencies among operations, control dependencies between BBs, frequency, latency, and various resource constraints. The SDC scheduling problem is typically solved iteratively: the HLS tool attempts to find a solution for the desired II; in case it is not found, the II is incremented and the scheduling procedure is repeated [7].
3.2 Polyhedral Analysis and Optimization
HLS tools must handle the scheduling of complex programs; thus, they need to describe program features in a compact, parametric, and general way.
Polyhedral analysis is a powerful compiler technique for describing program features such as loop properties (e.g., loop bounds, iterations, and strides) and memory accesses (e.g., memory access patterns and dependencies). It is used to reason about Static Control Parts (SCoPs) of the program, i.e., regions in which all control flow decisions and memory accesses are known at compile time. Within a SCoP, all loops and memory accesses can be described using integer polyhedra [9, 10].
Figure 8.11 shows examples of loops that are SCoPs: they have affine expressions in induction variables and parameters for loop bounds, control flow decisions, and memory accesses. The loops at the bottom of the figure cannot be described as SCoPs nor optimized with polyhedral analysis. In addition to loop properties, polyhedral techniques describe the memory access pattern of each load and store instruction, as illustrated in Fig. 8.12. Together with the schedule, this information is key to identify all read-after-write (RAW), write-after-read (WAR), and write-after-write (WAW) dependencies of the program.
Program transformations. The goal of polyhedral optimization is to simplify the program and expose parallelism by reordering loops and loop iterations and changing the program’s memory access patterns. The optimizations are performed by applying linear transformations on SCoPs such that the program semantics are preserved: independent loop iterations can be reordered and restructured, but those containing dependent references (e.g., memory dependencies) must be executed in the same order as in the original program [9, 10].
Polyhedral techniques are useful for optimizing HLS programs with statically determinable properties, as they can uncover new parallelism opportunities [10] that scheduling algorithms such as SDC scheduling (see Sect. 8.3.1) can exploit. However, they cannot be applied in general-purpose programs with irregular behaviors that cannot be captured by SCoPs.
3.3 Dynamic Scheduling
The techniques from the previous sections rely on the HLS compiler to devise the best possible schedule; while this approach is effective when critical information on program execution and behavior is available at compile time, it fails in cases with statically undeterminable memory accesses, variable operation latencies, and unpredictable control flow. In such situations, the HLS tool must assume the worst-case scenario and devise the schedule accordingly, which often results in suboptimal throughput and low performance.
An example of one such situation is shown in Fig. 8.13. The code in the figure has indirect memory accesses to array hist; depending on the values of x, there may or may not be a RAW dependency between a load and a store from a previous iteration. A standard HLS tool must assume the presence of a dependency, devise the appropriate scheduling constraint (similar to the sequencing constraint in Fig. 8.6), and create a conservative schedule (top of the figure), assuming that a dependency is always present.
The most general way to avoid the limitations of static scheduling is to forgo operation triggering through a pre-planned, statically scheduled controller (as shown on the left of Fig. 8.14) that dictates the exact execution time of each operation, but to make scheduling decisions as the circuit runs: as soon as all conditions for execution are satisfied (e.g., the operands are available or critical control decisions are resolved), an operation starts. Dataflow circuits [11,12,13] are a natural method to realize such behavior. They are built out of units that implement latency-insensitivity by communicating with their predecessors and successors through pairs of handshake control signals, which indicate the availability of a new piece of data from the source unit and the readiness of the target unit to accept it (as illustrated on the right of Fig. 8.14). The data is propagated from unit to unit as soon as memory and control dependencies allow it and stalled by the handshaking mechanism otherwise, thus effectively devising a dynamic schedule at circuit runtime. Such a schedule is shown at the bottom of Fig. 8.13: the pipeline is stalled only when a dependency actually exists, otherwise, the loads and stores to array hist may execute out of order for performance benefits.
Several works generate dataflow circuits from functional and imperative software program representations [14,15,16]. The most recent effort in the context of HLS for FPGAs is Dynamatic [16, 17], a complete and open-source HLS compiler that produces high-throughput dataflow circuits from C/C++ code. It incorporates features and compiler transformations to make dataflow circuits truly competitive in the context of modern HLS. The ability to adapt the schedule at runtime offers completely new optimization opportunities: memory dependencies can be resolved at runtime and key control decisions can be speculated on, just like in superscalar processors. Thus, dynamic HLS shows significant speedups when contrasted to state-of-the-art HLS tools [18, 19].
3.3.1 Pipelining and Resource Sharing in the Absence of a Static Schedule
Dataflow circuits must benefit from the same performance and area optimization opportunities as their statically scheduled counterparts; yet, classic scheduling and sharing algorithms described in Sect. 8.2 are not applicable in this context.
In contrast to devising a predetermined pipeline with a fixed II (e.g., by SDC modulo scheduling, as described in Sect. 8.3.1), the performance of dataflow circuits can be optimized via slack matching: inserting pipeline buffers (i.e., FIFOs) of appropriate sizes to prevent stalls and increase parallelism [20, 21], as shown on the left of Fig. 8.15. Slack matching can be combined with frequency optimization to achieve high-throughput synchronous dataflow circuits that honor the required clock period constraint [22, 23]. Similarly, in dataflow circuits, the sharing suitability of operations depends on runtime decisions and schedule adaptations—classic resource sharing techniques that rely on compile time concurrency information (see Sect. 8.2.4) are therefore not applicable. Thus, instead of reasoning about the exact execution times of each operator, dynamic HLS relies on the information on average unit utilization in the steady state of the system and determines what to share accordingly [24]; a sharing implementation of a dataflow unit is shown on the right of Fig. 8.15. These optimizations are key to making dynamic HLS performance- and resource-competitive with static HLS designs.
3.3.2 Dynamic Scheduling and Irregular Memory Accesses
When memory dependencies are statically unknown, standard HLS must assume the presence of the dependency—in terms of the scheduling algorithms above, a data dependency constraint conservatively dictates that the two accesses must be sequentialized. In contrast, dataflow circuits determine the presence or absence of a dependency during runtime using load-store queues (LSQs) [25,26,27], such as the one on the left of (Fig. 8.16). They compare the possibly conflicting addresses at runtime and enforce the access order only when they are dependent (e.g., in the example of Fig. 8.13, the LSQ postpones the load of the second iteration until the previous store, that targets the same address, completes); if the accesses are independent, the LSQ allows them to execute out of order (this is the case for the load of the third iteration, that executes before the preceding store). This scheduling flexibility and its performance benefits are impossible in static HLS.
3.3.3 Dynamic Scheduling and Speculative Execution
Speculation is a classic superscalar processor feature that can significantly improve the performance of loops where the loop condition takes a long time to compute by tentatively starting a new loop iteration before the loop condition is known. In static HLS, this optimization is limited to only trivial cases and otherwise hindered by the inability of the static schedule to revert the execution to a prior state in case of a misspeculation. Instead, dataflow circuits support generic forms of speculation [28]: speculative data travels through the circuit and dedicated components implement a distributed squash-and-replay mechanism (as shown on the right of Fig. 8.16), conceptually similar to that of superscalar processors, which achieves high parallelism in control-dominated applications.
3.3.4 VLIWs Versus Superscalars
Dynamic scheduling is in strong contrast to the strategy of Sect. 8.3.1 and in direct analogy to the contrast of VLIW and superscalar processor scheduling: In VLIWs, it is up to the compiler to devise the fixed schedule, which avoids the need to perform dependency checks at runtime (as the schedule guarantees that they are honored) and results in simpler hardware implementations [5, 29]. In contrast, superscalar processors [30] rely on more complex hardware mechanisms to resolve memory and control dependencies at runtime as well as to speculate on critical decisions; this flexibility makes them applicable to a wider variety of situations, which is why they are the generally accepted solution for general-purpose software applications. The situation is the same with static and dynamic scheduling in HLS: static HLS achieves great parallelism in particular application classes, where techniques such as polyhedral analysis and SDC modulo scheduling are successful, but irregular software programs require the flexibility of dynamic scheduling [18].
4 Current Status and Outlook
In this section, we provide an overview of current trends of HLS for FPGAs. We discuss typical usages of HLS and active open-source HLS frameworks, and outline some of the challenges that modern HLS is facing.
4.1 HLS Frameworks
In this section, we provide an overview of recent HLS frameworks targeting FPGAs.
C and OpenCL-based HLS frameworks. Apart from commercial HLS flows for FPGAs, such as AMD (formerly Xilinx) Vitis HLS [31] and Intel HLS [32], numerous open-source HLS projects are under active development. LegUp HLS [33] was originally developed as a complete open-source HLS flow, supporting C++ as well as task-oriented language constructs (e.g., OpenMP and Pthreads) [34]; it has recently been acquired by Microchip and is now closed-source. Bambu [35] is an open-source HLS research framework that supports a variety of FPGA backends and provides support for verification and debugging. Dynamatic [17] is an open-source HLS tool that produces dynamically scheduled dataflow circuits from C/C++ and supports features such as dynamic memory dependency resolution and speculation. DASS [36] has been developed on top of Dynamatic and Vitis HLS to combine the benefits of static and dynamic scheduling.
Compiler infrastructures. A majority of recent HLS flows rely on the well-established LLVM [37] compiler as a frontend to obtain an optimized intermediate representation from C/C++. LLVM provides a single IR describing the program as a CDFG, as described in Sect. 8.1.1; this serves as a starting point to implement either static or dynamic scheduling. Recently, MLIR [38] emerged as an alternative to LLVM; its compiler infrastructure allows the definition and composition of multiple IRs (also referred to as dialects), thus providing modularity and extensibility at different levels of abstraction. The CIRCT project [39] leverages MLIR to provide a variety of hardware-oriented features and abstractions; it incorporates some of the main transformations of Dynamatic, new IRs for hardware (e.g., Calyx [40]), and supports HLS code transformations (e.g., ScaleHLS, [41]). All this, together with a C-based frontend (i.e., Polygeist [42]), will likely serve as a basis for the development of future open-source HLS flows.
Domain-Specific Languages for HLS. Many HLS efforts explore domain-specific languages as an HLS frontend to raise the level of abstraction, increase productivity, and ease the expression of particular domain-specific constructs [43]. Popular DSLs target domains where HLS is successful, such as image processing (e.g., Halide [44,45,46]), machine learning (e.g., HeteroCL [47]), and streaming applications (e.g., Spatial [48]). Most DSLs use an existing C-based HLS flow as a backend and thus ultimately rely on the HLS techniques described in this chapter.
4.2 HLS Code Restructuring and Annotations
Thanks to the HLS frameworks and languages above, HLS is gaining popularity in domains such as machine learning, image processing, graph processing, video transcoding, and networking [49]. However, HLS is still facing critical adoption challenges due to the difficulties of extracting the desired levels of performance: despite the raised programming abstractions, HLS programmers still need to restructure the code and annotate it with pragmas to guide the HLS tool in achieving good parallelism and the desired hardware characteristics. This typically requires significant hardware design expertise and makes HLS unavailable to non-expert users [49, 50].
Consider the example in Fig. 8.17, illustrating a naively written code to add a set of integers held in external memory and store back the result, compared to a restructured code achieving significantly better performance: the restructured code accounts for the data widths, memory interface communication, and other architectural aspects, thus allowing the HLS scheduler to exploit the parallelism available in the computation. This code writing style is not necessarily accessible to software programmers who do not have knowledge of the underlying architecture. Although several works attempt to automate the pragma insertion process [51, 52] and despite the ability of the DSLs to hide many hardware-oriented details, the challenges of HLS programming are still one of the main factors preventing its broad usage.
4.3 Design Space Exploration
All the scheduling possibilities and constraints discussed in the previous sections, as well as the large design space achievable by different restructurings and annotations, create a complex design space with a variety of non-trivial design options: loops can be pipelined with different initiation intervals and unrolled with different factors, resource constraints can be formulated for different FPGA resource types (e.g., DSPs, BRAMs, etc.); the design can be optimized for throughput or latency, as well as tuned to different frequencies. This forms a multi-objective optimization problem that aims to minimize a set of, possibly conflicting, design parameters; the result is a set of points forming a Pareto frontier [53].
Due to the large search space, it is difficult to evaluate design quality and to understand whether the Pareto-optimal points have actually been found or approached. Secondly, the design space exploration itself requires the evaluation of particular points to continue the exploration in the appropriate direction. Some approaches synthesize each point with HLS and evaluate it on the fly (at the expense of long runtimes), whereas others build analytical models to estimate the area and performance (which may be less accurate, but faster) [53, 54]. Finally, different HLS tools have different implementation strategies and design spaces, so it is challenging to directly apply the notions of one tool’s DSE to the other [54]. With a plethora of new HLS techniques and relevant metrics arising, DSE will certainly increase in complexity—the ability to handle it efficiently will be key to navigate through this increasingly complex design space.
4.4 Functional and Formal Verification in HLS
HLS tools typically rely on functional verification of a particular circuit through hardware simulation and software/hardware cosimulation [55]. However, performing exhaustive hardware simulations may become unfeasible or extremely time-consuming as designs increase in complexity. Furthermore, the lack of formal proof on the correctness of particular compilation steps and the resulting hardware modules prevents the adoption of HLS in domains where design iterations are significantly more expensive [56].
Recent efforts aim to formally verify the HLS process [57, 58]. Others employ formal methods to optimize HLS-produced circuits: Cheng et al. use an SMT-based solver to improve the memory arbitration in HLS [59]. Geilen et al. employ model checking for buffering coarse-grain dataflow graphs [60]. Xu et al. use BDD-based reachability analysis [61] and induction [62] to prove particular behavioral properties of HLS-produced circuits and use them to improve their hardware implementation. Such formal methods are key to comprehensively reason about HLS transformations and the resulting circuits.
4.5 Frequency Estimates in HLS
A key task of HLS scheduling is to break combinational paths with registers and ensure that the circuit meets the target clock frequency, as discussed in Sect. 8.2.3. Yet, when placing registers, HLS typically relies on pre-characterized timing information [63] which fails to account for the effects of FPGA synthesis, placement, and routing, causing several undesired effects:
-
1.
The overestimated unit latencies may cause conservative pipelining and unneeded resource overheads (due to the redundant register placements and the prevention of logic optimizations across register-separated pipeline stages).
-
2.
The same conservative pipelining may unnecessarily decrease parallelism and, thus, performance, if a register is redundantly placed on a throughput-critical cycle.
-
3.
During placement and routing, the backend of the FPGA flow can introduce delay variations caused by interconnect delays that are difficult to estimate and cause frequency discrepancies from the target [49].
Recent works extend HLS scheduling formulations, such as the ones in Sect. 8.3, with physical design objectives; they aim to make HLS optimizations aware of the physical layout of the FPGA by including LUT mapping information into the scheduling problem [63,64,65] and estimating routing congestion and interconnect delays [66, 67]. This information is critical to advance HLS design quality and make them comparable to hand-optimized RTL [49, 50].
References
M. Hutton, V. Betz, J. Anderson, FPGA synthesis and physical design, in Electronic Design Automation for IC Implementation, Circuit Design, and Process Technology (CRC Press, 2017), pp. 395–436
R. Kastner, J. Matai, S. Neuendorffer, Parallel programming for FPGAs (2018). ArXiv e-prints arXiv:1805.03648
L. Torczon, K. Cooper, Engineering a Compiler, 2nd ed. (Morgan Kaufmann, 2011)
G. De Micheli, Synthesis and Optimization of Digital Circuits (McGraw-Hill, 1994)
B.R. Rau, Iterative modulo scheduling. Int. J. Parallel Programm. 24(1), 3–64 (1996)
J. Cong, Z. Zhang, An efficient and versatile scheduling algorithm based on SDC formulation, in Proceedings of the 43rd Design Automation Conference (San Francisco, CA, 2006) pp. 433–438
Z. Zhang, B. Liu, SDC-based modulo scheduling for pipeline synthesis, in Proceedings of the 32nd International Conference on Computer-Aided Design (San Jose, CA, 2013), pp. 211–218.
A. Canis, S.D. Brown, J.H. Anderson, Modulo SDC scheduling with recurrence minimization in high-level synthesis, in Proceedings of the 23rd International Conference on Field-Programmable Logic and Applications (Munich, 2014), pp. 1–8
L. Pouchet, P. Zhang, P. Sadayappan, J. Cong, Polyhedral-based data reuse optimization for configurable computing, in Proceedings of the 21st ACM/SIGDA International Symposium on Field Programmable Gate Arrays (Monterey, CA, 2013), pp. 29–38
W. Zuo, Y. Liang, P. Li, K. Rupnow, D. Chen, J. Cong, Improving high level synthesis optimization opportunity through polyhedral transformations, in Proceedings of the 21st ACM/SIGDA International Symposium on Field Programmable Gate Arrays (Monterey, CA, 2013), pp. 9–18
L.P. Carloni, K.L. McMillan, A.L. Sangiovanni-Vincentelli, Theory of latency-insensitive design. IEEE Trans. Comput.-Aided Des. Integrated Circ. Syst. 20(9), 1059–1576
J. Cortadella, M. Kishinevsky, B. Grundmann, Synthesis of synchronous elastic architectures, in Proceedings of the 43rd Design Automation Conference (San Francisco, CA, 2006), pp. 657–662
S.A. Edwards, R. Townsend, M.A. Kim, Compositional dataflow circuits, in Proceedings of the 15th ACM-IEEE International Conference on Formal Methods and Models for System Design (Vienna, 2017), pp. 175–184
R. Townsend, M.A. Kim, S.A. Edwards, From functional programs to pipelined dataflow circuits, in Proceedings of the 26th International Conference on Compiler Construction, (Austin, TX, 2017), pp. 76–86.
M. Budiu, S.C. Goldstein, Pegasus: An Efficient Intermediate Representation (Carnegie Mellon University, Tech. Rep. CMU-CS-02-107, 2002)
L. Josipović, R. Ghosal, P. Ienne, Dynamically scheduled high-level synthesis, in Proceedings of the 26th ACM/SIGDA International Symposium on Field Programmable Gate Arrays (Monterey, CA, 2018), pp. 127–136.
L. Josipović, A. Guerrieri, P. Ienne, Dynamatic: From C/C++ to dynamically scheduled circuits, in Proceedings of the 28th ACM/SIGDA International Symposium on Field Programmable Gate Arrays (Seaside, CA, 2020), pp. 1–10
L. Josipović, A. Guerrieri, P. Ienne, Synthesizing general-purpose code into dynamically scheduled circuits. IEEE Circ. Syst. Magaz. 21(1), 97–118 (2021)
L. Josipović, A. Guerrieri, P. Ienne, From C/C++ code to high-performance dataflow circuits. IEEE Trans. Comput.-Aided Des. Integrat. Circ. Syst. 41(7), 2142–2155 (2022)
G. Venkataramani, S.C. Goldstein, Leveraging protocol knowledge in slack matching, in Proceedings of the 25th International Conference on Computer-Aided Design (San Jose, CA, 2006), pp. 724–729
M. Najibi, P.A. Beerel, Slack matching mode-based asynchronous circuits for average-case performance, in Proceedings of the 32nd International Conference on Computer-Aided Design (San Jose, CA, 2013), pp. 219–225
L. Josipović, S. Sheikhha, A. Guerrieri, P. Ienne, J. Cortadella, Buffer placement and sizing for high-performance dataflow circuits, in Proceedings of the 28th ACM/SIGDA International Symposium on Field Programmable Gate Arrays (Seaside, CA, 2020), pp. 186–196
C. Rizzi, A. Guerrieri, P. Ienne, L. Josipović, A comprehensive timing model for accurate frequency tuning in dataflow circuits, in Proceedings of the 22nd International Conference on Field-Programmable Logic and Applications (Belfast, UK, 2022), pp. 375–383
L. Josipović, A. Marmet, A. Guerrieri, P. Ienne, Resource sharing in dataflow circuits, in Proceedings of the 30th IEEE Symposium on Field-Programmable Custom Computing Machines (New York, 2022), pp. 1–9
L. Josipović, P. Brisk, P. Ienne, An out-of-order load-store queue for spatial computing. ACM Trans. Embedded Comput. Syst. 16(5s), 125:1–125:19 (2017)
L. Josipović, A. Bhattacharyya, A. Guerrieri, P. Ienne, Shrink it or shed it! minimize the use of LSQs in dataflow designs, in Proceedings of the IEEE International Conference on Field Programmable Technology (Tianjin, 2019), pp. 197–205
J. Liu, C. Rizzi, L. Josipović, Load-store queue sizing for efficient dataflow circuits, in Proceedings of the IEEE International Conference on Field Programmable Technology (Hong Kong, 2022), pp. 1–9
L. Josipović, A. Guerrieri, P. Ienne, Speculative dataflow circuits, in Proceedings of the 27th ACM/SIGDA International Symposium on Field Programmable Gate Arrays, Seaside, CA, Feb. 2019, pp. 162–71.
M.S. Lam, Software pipelining: an effective scheduling technique for VLIW machines, in Proceedings of the 1988 ACM Conference on Programming Language Design and Implementation (Atlanta, GA, 1988), pp. 318–328
J.L. Hennessy, D.A. Patterson, Computer Architecture: A Quantitative Approach, 5th ed. (Morgan Kaufmann, 2011)
Vitis High-Level Synthesis User Guide, AMD, 2022. [Online]. Available: https://docs.xilinx.com/r/en-US/ug1399-vitis-hls
Intel HLS Compiler Pro Edition Reference Manual, Intel, 2022. [Online]. Available: https://www.intel.com/content/www/us/en/docs/programmable/683349/22-3/pro-edition-reference-manual.html
A. Canis, J. Choi, M. Aldham, V. Zhang, A. Kammoona, T. Czajkowski, S.D. Brown, J.H. Anderson, LegUp: an open-source high-level synthesis tool for FPGA-based processor/accelerator systems,. ACM Trans Embedded Comput. Syst. 13(2), 24:1–24:27 (2013)
J. Choi, S. Brown, J. Anderson, From software threads to parallel hardware in high-level synthesis for FPGAs, in Proceedings of the IEEE International Conference on Field Programmable Technology (Kyoto, 2013), pp. 270–277
F. Ferrandi, V.G. Castellana, S. Curzel, P. Fezzardi, M. Fiorito, M. Lattuada, M. Minutoli, C. Pilato, A. Tumeo, Bambu: an open-source research framework for the high-level synthesis of complex applications, in Proceedings of the 58th Design Automation Conference (Virtual, 2021), pp. 1327–1330
J. Cheng, L. Josipović, G.A. Constantinides, P. Ienne, J. Wickerson, Combining dynamic and static scheduling in high-level synthesis, in Proceedings of the 28th ACM/SIGDA International Symposium on Field Programmable Gate Arrays (Seaside, CA, 2020), pp. 288–298
http://www.llvm.org, The LLVM Compiler Infrastructure, 2018. [Online]. Available: http://www.llvm.org
https://mlir.llvm.org/, Multi-Level IR Compiler Framework, 2020. [Online]. Available: https://mlir.llvm.org/
https://github.com/llvm/circt, CIRCT IR Compiler and Tools, 2020. [Online]. Available: https://github.com/llvm/circt
R. Nigam, S. Thomas, Z. Li, A. Sampson, A compiler infrastructure for accelerator generators, in Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (Virtual, 2021), pp. 804–817
H. Ye, C. Hao, J. Cheng, H. Jeong, J. Huang, S. Neuendorffer, D. Chen, ScaleHLS: a new scalable high-level synthesis framework on multi-level intermediate representation, in Proceedings of the IEEE International Symposium on High-Performance Computer Architecture (Seoul, 2022), pp. 741–755
W.S. Moses, L. Chelini, R. Zhao, O. Zinenko, Polygeist: raising C to polyhedral MLIR, in Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques (Virtual, 2021), pp. 45–59
N. George, H. Lee, D. Novo, T. Rompf, K. Brown, A. Sujeeth, M. Odersky, K. Olukotun, P. Ienne, Hardware system synthesis from domain-specific languages, in Proceedings of the 23rd International Conference on Field-Programmable Logic and Applications (Munich, 2014), pp. 1–8
J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, S. Amarasinghe, Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. ACM Sigplan Notices 48(6), 519–530 (2013)
J. Li, Y. Chi, J. Cong, HeteroHalide: from image processing DSL to efficient FPGA acceleration, in Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (Seaside, CA, 2020), pp. 51–57.
J. Pu, S. Bell, X. Yang, J. Setter, S. Richardson, J. Ragan-Kelley, M. Horowitz, Programming heterogeneous systems from an image processing DSL. ACM Trans. Arch. Code Opt. 14(3), 1–25 (2017)
Y.-H. Lai, Y. Chi, Y. Hu, J. Wang, C. H. Yu, Y. Zhou, J. Cong, Z. Zhang, HeteroCL: a multi-paradigm programming infrastructure for software-defined reconfigurable computing, in Proceedings of the 27th ACM/SIGDA International Symposium on Field Programmable Gate Arrays (Seaside, CA, 2019), pp. 242–251
D. Koeplinger, M. Feldman, R. Prabhakar, Y. Zhang, S. Hadjis, R. Fiszel, T. Zhao, L. Nardi, A. Pedram, C. Kozyrakis et al., Spatial: a language and compiler for application accelerators, in Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation (Philadelphia, PA, 2018), pp. 296–311.
J. Cong, J. Lau, G. Liu, S. Neuendorffer, P. Pan, K. Vissers, Z. Zhang, FPGA HLS today: successes, challenges, and opportunities. ACM Trans. Reconfigurable Tech. Syst. 15(4), 1–42 (2022)
Y.-H. Lai, E. Ustun, S. Xiang, Z. Fang, H. Rong, Z. Zhang, Programming and synthesis for software-defined FPGA acceleration: status and future prospects. ACM Trans. Reconf. Tech. Syst. 14(4), 1–39 (2021)
J. Cong, M. Huang, P. Pan, Y. Wang, P. Zhang, Source-to-source optimization for HLS. in FPGAs for Software Programmers (Springer, 2016), pp. 137–163.
J. Lau, A. Sivaraman, Q. Zhang, M.A. Gulzar, J. Cong, M. Kim, HeteroRefactor: refactoring for heterogeneous computing with FPGA, in 2020 IEEE/ACM 42nd International Conference on Software Engineering, (Seoul, 2020), pp. 493–505
B.C. Schafer, Z. Wang, High-level synthesis design space exploration: past, present, and future. IEEE Trans. Comput.-Aided Des. Int. Circ. Syst. 39(10), 2628–2639 (2020)
A. Sohrabizadeh, C.H. Yu, M. Gao, J. Cong, AutoDSE: enabling software programmers to design efficient FPGA accelerators. ACM Trans. Des. Automat. Electron. Syste. 27(4), 1–27 (2022)
Vivado High-Level Synthesis, Xilinx Inc., 2018. [Online]. Available: http://www.xilinx.com/products/ design-tools/vivado/integration/esl-design.html
J. Cong, B. Liu, S. Neuendorffer, J. Noguera, K. Vissers, Z. Zhang, High-level synthesis for FPGAs: from prototyping to deployment. IEEE Trans. Comput.-Aided Des. Int. Circ. Syst. 30(4), 473–491 (2011)
Y. Herklotz, Z. Du, N. Ramanathan, J. Wickerson, An empirical study of the reliability of high-level synthesis tools, in 2021 IEEE 29th Annual International Symposium on Field-Programmable Custom Computing Machines (2021), pp. 219–223
F. Faissole, G.A. Constantinides, D. Thomas, Formalizing loop-carried dependencies in Coq for high-level synthesis, in 2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines, (2019), pp. 315–315
J. Cheng, S.T. Fleming, Y.T. Chen, J. Anderson, J. Wickerson, G.A. Constantinides, Efficient memory arbitration in high-level synthesis from multi-threaded code. IEEE Trans. Comput. 71(4), 933–946 (2022)
M. Geilen, T. Basten, S. Stuijk, Minimising buffer requirements of synchronous dataflow graphs with model checking, in Proceedings of the 42nd Design Automation Conference (Anaheim, CA, 2005), pp. 819–824
J. Xu, E. Murphy, J. Cortadella, L. Josipović, Eliminating excessive dynamism of dataflow circuits using model checking, in Proceedings of the 31st ACM/SIGDA International Symposium on Field Programmable Gate Arrays (Monterey, CA, 2023), pp. 27–37
J. Xu, L. Josipović, Automatic inductive invariant generation for scalable dataflow circuit verification, in Proceedings of the 42nd IEEE/ACM International Conference on Computer-Aided Design (San Francisco, CA, 2023, to appear)
M. Tan, S. Dai, U. Gupta, Z. Zhang, Mapping-aware constrained scheduling for LUT-based FPGAs, in Proceedings of the 23rd ACM/SIGDA International Symposium on Field Programmable Gate Arrays (Monterey, CA, 2015), pp. 190–199
C. Rizzi, A. Guerrieri, L. Josipović, An iterative method for mapping-aware frequency regulation in dataflow circuits, in Proceedings of the 60rd ACM/IEEE Design Automation Conference (San Francisco, CA, 2023, to appear)
H. Wang, C. Rizzi, L. Josipović, MapBuf: Simultaneous technology mapping and buffer insertion for hls performance optimization, in Proceedings of the 42nd IEEE/ACM International Conference on Computer-Aided Design (San Francisco, CA, 2023, to appear)
L. Guo, Y. Chi, J. Wang, J. Lau, W. Qiao, E. Ustun, Z. Zhang, J. Cong, Autobridge: coupling coarse-grained floorplanning and pipelining for high-frequency HLS design on multi-die FPGAs, in Proceedings of the 29th ACM/SIGDA International Symposium on Field Programmable Gate Arrays (Virtual, 2021), pp. 81–92
J. Zhao, T. Liang, S. Sinha, W. Zhang, Machine learning based routing congestion prediction in FPGA high-level synthesis, in Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (Florence, 2019), pp. 1130–1135
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this chapter
Cite this chapter
Tu, K., Tang, X., Yu, C., Josipović, L., Chu, Z. (2024). High-Level Synthesis. In: FPGA EDA. Springer, Singapore. https://doi.org/10.1007/978-981-99-7755-0_8
Download citation
DOI: https://doi.org/10.1007/978-981-99-7755-0_8
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-7754-3
Online ISBN: 978-981-99-7755-0
eBook Packages: Computer ScienceComputer Science (R0)