Accelerating OCaml Programs on FPGA

Sylvestre, Loïc; Chailloux, Emmanuel; Sérot, Jocelyn

doi:10.1007/s10766-022-00748-z

Accelerating OCaml Programs on FPGA

Published: 24 January 2023

Volume 51, pages 186–207, (2023)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

International Journal of Parallel Programming Aims and scope Submit manuscript

Accelerating OCaml Programs on FPGA

Download PDF

363 Accesses
2 Citations
1 Altmetric
Explore all metrics

Abstract

This paper aims to exploit the massive parallelism of Field-Programmable Gate Arrays (FPGAs) by programming them in OCaml, a multiparadigm and statically typed language. It first presents O2B, an implementation of the OCaml virtual machine using a softcore processor to run the entire OCaml language on an FPGA. It then introduces Macle, a language to express, in ML-style, hardware-accelerated user-defined functions, implemented as gates and registers on the same FPGA. Macle allows to implement pure computations and compose them in parallel. It also supports processing of dynamic data structures such as arrays, matrices and trees allocated by the OCaml runtime in the memory of the softcore processor. Macle functions can then be called, as hardware accelerators, by OCaml programs executed by O2B. This combination of Macle and OCaml codes in a single source program enables to easily prototype FPGA applications mixing numeric and symbolic computations.

High Performance Stream Processing on FPGA

Using FPGA Devices to Accelerate Tree-Based Genetic Programming: A Preliminary Exploration with Recent Technologies

A Special-Purpose Language for Implementing Pipelined FPGA-Based Accelerators

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Reconfigurable circuits, like Field-Programmable Gate Arrays (FPGAs), are suited to design custom architectures exploiting the concurrent nature of hardware structures [6]. The configuration of an FPGA is commonly produced by a synthesis toolchain from a description expressed in hardware description language (HDL) such as VHDL or Verilog. Other examples of more expressive HDLs include Chisel [4] embedded in Scala, Clash [3] in Haskell, MyHDL [10] in Python and HardCaml^{Footnote 1} in OCaml. Nevertheless, the Register Transfer Level (RTL) programming model, on which HDLs are based, is characterized by a very low level of abstraction. Hence, different approaches aim to hardware-accelerate software applications using FPGAs.

There have been some attempts to compile small applicative languages, such as SHard [23], FLOH [27] and Basic SCI [14], directly to RTL [13]. A representative example is SAFL (Statically Allocated Functional Language) [20], which is a first-order ML-like language limited to tail recursion and static data structures.
For more complex languages, custom processors or virtual machines can be implemented in RTL to run high-level languages on FPGA. JAIP [28] is a Java Virtual Machine (JVM) written in VHDL, calling a softcore processor^{Footnote 2} to handle dynamic class-loading. JikesRVM [19] is a JVM implemented on a CPU using an FPGA for accelerating automatic dynamic memory management.
High-Level Synthesis (HLS) promotes the use of imperative languages to design hardware [21]. Most HLS tools, such as Catapult C or Handel-C, support a subset of C annotated with pragmas to optimize the compilation to RTL. LegUp [5] runs C programs on a softcore processor while compiling functions—those that do not use dynamic allocation and recursion—to RTL. Pylog [15] proposes a similar approach for running Python on FPGA platforms having a hardcore processor.
Other HLS tools^{Footnote 3} use OpenCL to express parallel applications and target heterogenous architectures involving Multicores, GPUs and FPGAs. Aparapi [24] and GVM [12] implement the JVM in OpenCL. TAPA [7] is a framework for task parallelism targeting OpenCL. TornadoVM [22] compiles specially annotated Java code to OpenCL. These tools, however, do not sufficiently expose the fine-grained parallelism available on the FPGAs nor their customization possibilities.
FPGAs allows implementing parallel programming models [18] like task-parallelism [7] and parallel skeletons [9]. For instance, Lime [2] is a task-based data-flow programming language compiled to OpenCL or Verilog and interacting with Java bytecode running on a CPU. Kiwi [25] is a subset of C\(\sharp \) compiled to RTL and offering events, monitors and threads. RIPL [26] is an image processing language with a collection of parallel skeletons.

The work described in this paper builds on the results of these experiments, by proposing an approach in which:

a runtime system, implemented on a softcore processor, is used to allow high-level programming on FPGA (like JAIP);
hardware acceleration of user-defined functions (like SAFL) is provided by partitioning the application code between the host and the accelerated code (like Pylog);
the host language, running the softcore processor, and the embedded language used to describe accelerated functions are similar (like Lime);
some predefined parallel constructs are provided to ease exploitation of the massive parallelism offered by FPGAs (like Kiwi).

To this end, we have:

1.
ported the OCaml virtual machine (VM) and its runtime (including a garbage collector) on a softcore processor to support the entire OCaml language.
2.
combined this VM approach with hardware acceleration of user-defined functions expressed in an ML-like language. This language is extended with parallelism skeletons in order to process dynamic data structures allocated by the OCaml runtime in the memory of the softcore processor.

This approach allows to take full advantage of the fine-grained parallelism of FPGAs, while programming them in OCaml, and hence supporting quick prototyping, static type-checking, simulation and debugging of applications mixing numeric and symbolic computations.

Our contributions are:

O2B^{Footnote 4} (OCaml On Board), a port of OMicroB [29] (an implementation of the OCaml Virtual Machine) targeting the Nios II softcore processor realized on an FPGA. O2B enables to call custom hardware accelerators from OCaml programs.
Macle^{Footnote 5} (ML accelerator), a subset of OCaml designed to express hardware-accelerated user-defined functions, called Macle circuits. These functions—distinguished with a special keyword “”—are compiled to VHDL and synthesized on an FPGA to be used as hardware accelerators from OCaml source programs executed by O2B. Glue code is automatically generated. It includes C and OCaml code, VHDL descriptions and scripts to control the end-to-end synthesis workflow. Macle supports OCaml data structures (such as lists, trees, arrays and matrices) allocated in the OCaml VM heap. Macle allows general recursion by automatically rewriting it into tail-recursion with a local stack implemented in on-chip memory. Finally, Macle provides parallelism skeletons over OCaml arrays to expose fine-grained parallelism and optimize memory transfers.

The remainder of this paper is organized as follows. Sect. 2 introduces the O2B infrastructure to run OCaml programs on an FPGA. Sect. 3 proposes an approach to accelerate OCaml programs augmented with Macle circuits. Sect. 4 presents the compilation of Macle, using an intermediate language (HSML, Hierarchical State Machine Language) to abstract the VHDL target. Sect. 5 evaluates our approach on different benchmarks to measure the speedup resulting from using hardware-acceleration in Macle. Sect. 6 describes a mechanism using parallel skeletons to optimize memory transfers when accessing the OCaml heap. Sect. 7 discusses the acceleration possibilities and the programming style obtained and then identifies future work.

2 The O2B Framework

O2B (OCaml On Board) is a tool to run OCaml programs on FPGAs. It is based on OMicroB [29], an implementation of the OCaml VM dedicated to high-level programming of microcontrollers with scarce resources.

2.1 Compilation Flow for OCaml to FPGAs

Figure 1 describes the configuration process used to run OCaml programs on an Intel FPGA^{Footnote 6} via O2B. The bytecode generated by the OCaml compiler is transformed into a static C array, then embedded in the C program implementing the bytecode interpreter and the O2B runtime library, including a garbage collector (GC). The OCaml heap and stack are C static arrays. This program is associated with the functions of the Board Support Package (BSP) giving access to the hardware resources of the target board. The resulting application is compiled to binary code executable by the Nios II softcore processor.

The complete FPGA configuration includes the exact architecture of the processor used as well as a set of external RTL descriptions \({\textsf {F1}~\cdots ~\textsf {Fn}}\) to be implemented as custom components \({\textsf {C1}~\cdots ~\textsf {Cn}}\). Technically, this configuration step is carried out by the QSys tool of the Intel Quartus chain. It generates a set of VHDL files which constitutes the description of the hardware platform. This description includes the components \({\textsf {C1}~\cdots ~\textsf {Cn}}\) and the Nios II processor to be synthesized through the Quartus chain to reconfigure the FPGA.

The OCaml heap and stack can be stored either in the on-chip memory of the target FPGA (for small programs) or in the external memory (SDRAM) of the board. In both cases, access is provided by means of an interconnection bus.^{Footnote 7} This bus also supports data transfers between the custom components and the binary code executed by the processor. Both the softcore and the custom components can access the physical IOs of the FPGA.

2.2 Calling Accelerators from OCaml Programs

The OCaml language offers an OCaml/C foreign function interface (FFI) to call C functions from OCaml programs. These C functions, running on the softcore, can in turn invoke custom components implemented on the FPGA. It is thus possible to use custom components from OCaml programs compiled to bytecode executed by O2B. The communication layer between O2B and a custom component is done via a set of dedicated registers associated to the component and mapped into the memory of the softcore processor.

Figure 2 shows the source code of an OCaml program designed to run with O2B. It defines three implementations of the greatest common divisor (GCD) algorithm: one in OCaml using a tail recursion, one in C using a loop, and one in C using an hardware accelerator that will be defined at Fig. 7 in VHDL.

The difference between two calls to Timer.get_us (before and after a computation) in the OCaml function chrono gives the execution time of the argument function call in microseconds. The C function printf, and by extension, the OCaml functions print_int and print_string use the Board Support Package of the FPGA target to write on a console.^{Footnote 8} The gcd_c and gcd_rtl functions are defined as external functions in the OCaml code using the standard FFI mechanism. Calling a custom component from the gcd_rtl function involves sending the argument to and retrieving the result from the dedicated registers of the custom component. The corresponding operations are abstracted by the C macros GCD_ARG, GCD_START, GCD_RDY and GCD_RESULT). The behavioral description of GCD in VHDL requires 50 lines of code, plus 100 lines of VHDL glue code for argument and result passing. Finally, this GCD component must be mapped into the global configuration of the system implemented on the FPGA (called the System on Programmable Chip, SoPC), either manually (using the QSys tool) or by scripting. In practice, this limits the use of the O2B framework to programmers familiar both with VHDL and the target toolchain (Intel Quartus in the current distribution).

In the rest of the paper, we describe how this limitation can be overcome by providing a ML-like language, similar to OCaml, to describe hardware-accelerated functions and automatically generate both the FPGA configuration and glue code to interact with OCaml programs running on the softcore.

3 High-Level FPGA Programming

The proposed approach relies on a dedicated ML-like language to express hardware-accelerated functions. This language, called Macle (ML Accelerator), can interoperate with the OCaml runtime of O2B and therefore can be used to accelerate OCaml host programs running on a softcore processor realized on an FPGA.

3.1 Compilation Flow

Figure 3 shows our compilation flow of OCaml to FPGA. It automatically generates the configuration of an FPGA from an OCaml program extended with hardware-accelerated functions defined in Macle. The OCaml code is compiled to bytecode to be executed by O2B targeting a softcore processor implemented on the FPGA whereas each Macle circuit is compiled to VHDL and then synthesized as a custom hardware component usable from the OCaml program. The glue code (including OCaml, C and VHDL files) is automatically generated from inferred types of Macle circuits. The compilation flow can therefore be used by a programmer without prior knowledge in VHDL or experience with the target FPGA programming toolset.

In Fig. 3, the two double arrows denote the calls of C functions from OCaml and the use of VHDL code from C.

3.2 The Macle Language

Macle is an ML-like language which includes:

a functional-parallel core language (called Macle Core) compiled to RTL;
additional language constructs (implemented in RTL) to interact with the OCaml runtime.

Figure 4a defines the syntax of Macle Core. This language is independent of OCaml and can be used to program synchronous circuits and compose them in parallel. We denote by \(\smash {\overrightarrow{o}}\) (or \(o_1~\cdots ~o_n\)) a non-empty sequence of objects \(o_i\). Macle Core includes variables (taken from a set of name \(\mathscr {X}\)), constants, applications of built-in operators and conditionals. It also offers local mutually tail-recursive functions, function calls and let bindings. A simple let binding first computes e, then \(e'\). By extension, a multiple let-binding first computes the expressions \(e_1~\cdots e_n\) in parallel and synchronizes before computing \(e'\). For instance, the hardware implementation of instantiates twice the implementation of factorial function in order to enable their parallel execution. Function calls use an implicit parallel let-binding to compute the arguments passed to each function. Non-recursive functions can take functions as arguments.^{Footnote 9}

Figure 4b defines the syntax of the Macle subset interacting with the OCaml runtime. It comprises:

for raising a built-in exception exn such as Failure (parametrized by a literal strings) or Stack_overflow;
for destructuring (i.e., non-nested pattern matching) values of an algebraic datatype;
\(\texttt {!}e\) for accessing the content of the reference e;
\(e :=e'\) for setting the content of the reference e to the value of \(e'\);
\(e\texttt {.(}e'{} \texttt {)}\) for accessing at the index \(e'\) of the array e;
\(e\texttt {.(}e'{} \texttt {)} \leftarrow e''\) for setting the value of \(e''\) at the index \(e'\) of the array e;
for accessing the length of the array e;
\(e\;;~e'\) which is a syntactic sugar for where x is a fresh name;
which is a syntactic sugar for a tail-recursive formulation using let rec.

Note that, currently, Macle circuits cannot allocate data structures; they can only manipulate values allocated by the VM in the OCaml heap.

To preserve the semantics and the safety of the Macle code, multiple let-bindings are sequentialized when they contain memory accesses or raise an exception.

General recursion is supported via a program transformation producing code containing only tail-recursive calls and using an explicit stack. When the stack overflows, an exception (Stack_overflow) is raised. Recursion in Macle uses an explicit call stack, as described in Sect. 5. Tail-recursion does not require a stack.

Figure 5a shows three Macle circuits. The circuit gcd_rtl expresses the GCD algorithm in Macle Core. The circuit collatz computes the stopping time of a Collatz [16] sequence (also called Syracuse) starting from a given integer. The circuit sum_array sums the elements of a given OCaml array.

Figure 5b shows an OCaml program calling a Macle circuit. It allocates an abstract syntax tree in the OCaml heap and evaluates it using the eval_exp Macle circuit. Accesses to the OCaml heap are safe since the exception Failure is (implicitly) raised in case of an out-of-bounds index or a non-exhaustive pattern matching. This exception can then be caught in OCaml by the construct. In this example, the program evaluates the expression Add(Int(1),Var(0)) recursively and prints the result. Evaluating Var(0) fetches the value at the index 0 of the array.

4 Compiling Macle

The global compilation flow from Macle to VHDL is depicted Fig. 6. It involves four passes. The first pass (1) consists in normalizing the source code:

renaming all bindings in the source code with unique names;
rewriting the code in so-called Administrative Normal Form [17] (introducing let-bindings for each step of computation);
inlining functions by recursively duplicating their body at each call site (except recursive ones);
transforming recursive functions which are not tail-recursive into tail-recursive ones using an explicit stack.

The second pass (2) compiles Macle into an intermediate language, called HSML (Hierarchical State Machine Language), allowing to express parallel composition of hierarchical finite state machines. The third pass (3) flattens the hierarchical structure of HSML. The fourth pass (4) translates a flat HSML description into VHDL. At each point of the compilation flow, an OCaml backend is provided for simulation and debugging on a PC, as indicated by the dotted arrows.

The rest of this section focuses on pass 2 and is restricted to Macle Core.

4.1 Targeting the Register Transfer Level

Synchronous finite state machines (FSM) are commonly used to describe computations at the register transfer level (RTL). An FSM is classically defined by a set of states (names) and a set of transitions. Each transition connects a source state to a destination state and can be associated to a set of guards and a set of actions. Guards define when the transition is enabled. They can depend on inputs and local variables.

Actions are performed when the transition is enabled and can write outputs and local variables. Transitions are only taken at the rising edge of a global clock. At each clock edge, if a transition starting from the current state has all its guards validated, it is enabled, the associated actions are performed (instantaneously) and the destination state becomes the current state.

FSMs are classically encoded in VHDL as synchronous processes with asynchronous reset. Inputs, outputs and local variables are implemented as VHDL signals with a dedicated signal representing the current state. At each rising edge of the input clock, depending on the value of the current state and some conditions involving inputs and local variables, the next state value is selected and the value of outputs and local variables is updated. The FSM is re-initialized, asynchronously, whenever the reset input signal becomes true.

Figure 7 gives a graphical representation of an FSM describing the computation of GCD and its encoding in VHDL. The start input and rdy output are used respectively to start and signal the termination of the computation. In the VHDL code, modifications of the state variable STATE as well as the outputs and local variables use signal assignments (\({\texttt {<signal\_name> <= <expression>}}\)). Assignments occurring at the same clock edge are performed concurrently, i.e., the expressions denoted by the right hand sides (RHSs) are all evaluated in parallel and then, and only then, the signals designated by the left hand sides (LHSs) are updated simultaneously.

Note that in the code given Fig. 7, arguments and result are encoded as 31-bit signed integers to have the same representation of OCaml values than in the O2B runtime. This enhances the interoperability between Macle and OCaml.

By declaring separate processes—each encoding a given FSM—within the same VHDL architecture, it is easy to implement synchronous parallel composition of FSMs. Each FSM is triggered by the same global clock and has access to the signals declared in the architecture. However, these signals can only be shared for reading as a signal written by a process cannot be written by another process.

4.2 An FSM-Based Intermediate Language

We do not compile Macle circuits directly to VHDL. Instead, we use an intermediate language, HSML (Hierarchical State Machine Language) for describing the behavior of FSMs and expressing their composition, and which can be easily translated to VHDL. HSML is inspired by well-known FSM-based formalisms and languages with notions of hierarchy and compositionality, such as Statecharts [11], Communicating Hierarchical State Machines [1] and Lustre-like languages with automata [8].

Figure 8 defines the syntax of HSML. A circuit is a parallel composition of FSMs \((A_1~\Vert ~\cdots ~A_n)\) depending on inputs, modifying outputs and using local variables. An FSM is a set of zero or more mutually recursive transitions in the scope of a body used to initialize it. A transition is a thunk \(f\texttt {()} = A\) associating a name f to an FSM A. HSML offers a notion of hierarchy. For instance, given a transition t and an output x, the FSM is a hierarchical formulation of the FSM .

An FSM A is a mutually recursive definition of transitions , a conditional, an assignment , a branch \(f\texttt {()}\) to a transition \(f\texttt {()} = A\) or a local parallel composition of FSM . An assignment evaluates the expressions \(e_1~\cdots ~e_n\), then assigns the results to the variables \(x_1~\cdots ~x_n\) and finally computes A. An expression e is a variable, a constant or the application of a built-in operator. Logical operators \(\wedge \) and \(\vee \) are strict.

Figure 9 shows an HSML circuit corresponding to the VHDL code given Fig. 7. This circuit was automatically generated from the Macle circuit gcd_rtl defined Fig. 5.

HSML exposes the semantics of the register transfer level (RTL, informally presented on the VHDL code of Fig. 7) while allowing hierarchical formulations that makes it close to an expression language. In particular, some HSML constructs (such as let rec and conditional) are common with Macle. Therefore, HSML constitutes a useful intermediate language for compiling Macle to VHDL.

4.3 Compiling Macle Core to HSML

The compilation of a Macle Core circuit is defined as the compilation of the body e of the circuit, from which the inputs, outputs and local variables are inferred.

The compilation \(\mathscr {C}_{\textsf {}}\llbracket e\rrbracket ^{\textit{start},\textit{rdy},\textit{result}}\) of a Macle Core expression e is a hierarchical FSM initialized in a special state idle. It waits for the input start to be set to the value to start the computation. This computation assigns a value to the output result. The output rdy notifies when the computation is done. The auxiliary function \(\mathscr {C}_{\textsf {e}}\llbracket e\rrbracket ^{\textit{result},\textit{idle}}_\rho \) is defined next. The compilation environment \(\rho \) maps functions names to the list of their formal arguments.

The compilation \(\mathscr {C}_{\textsf {e}}\llbracket e\rrbracket ^{\textit{r},\textit{idle}}_\rho \) of a subexpression is inductively defined on the syntax of the expressions. The compilation of a subexpression e which does not contain control structures is defined as an affectation of e to a variable r continuing with a tail-call to a destination.

The compilation of a Macle conditional is an HSML conditional, subexpressions being inductively compiled.

Compiling a let rec globalizes function parameters. To achieve this, each function name introduced by a let rec is bound to the list of its formal parameters within the compilation environment \(\rho \). The extension of \(\rho \) with a function name f bound to its parameters \(x_1~\cdots ~x_n\) is denoted by \(\rho [f\mapsto (x_1,~\cdots ~x_n)]\), assuming that f is not in the domain of \(\rho \). Alternatively, the compilation of a function call \((f~x_1~\cdots ~x_n)\) is an assignment of the values \(x_1~\cdots ~x_n\) to the formal parameters \(y_1~\cdots ~y_n\) given by \(f(\rho )\), continuing with a call to \(f\texttt {()}\).

The compilation of a let with a single binding is defined as the compilation of the subexpression e into the variable x continuing with the compilation of the body \(e'\).

The compilation of a let with more than one binding is defined as a parallel composition of FSMs followed by a synchronization barrier activating the execution of the compiled body of the let.

Parallel let-bindings provide the main possibilities of acceleration of OCaml programs on FPGA as shown in the next section.

5 Examples and Benchmarks

We now evaluate the speedup that can be achieved by running OCaml programs on FPGA via O2B and Macle. These programs are assessed by taking as reference an equivalent C code running on the same softcore processor. We first consider programs using circuits written in Macle Core (defined Fig. 4a) and then Macle circuits interacting with the OCaml runtime (using constructs defined Fig. 4b).

5.1 Methodology

Experimental setup We use a Max10 Intel FPGA embedded on a Terasic DE10-Lite board. This FPGA has limited resources: 50K logic elements (LEs), and 1638 Kbit of on-chip memory. The board itself has 64 MB of external memory (SDRAM) and a clock frequency of 50 MHz. From a given OCaml source program, O2B creates a C program containing the bytecode generated by the OCaml compiler, the VM, its runtime library (including a GC) and additional C code. The bytecode as well as the OCaml stack and heap are implemented with C static arrays, both stored in external memory. The stack size is of 6400 words of 4 bytes while the heap size is of 4 MB. The resulting C program is compiled via the Nios II backend of gcc with optimizations enabled (-Os). The Macle circuits (compiled to VHDL) and the softcore processor are synthetized using Quartus 20.1. All data structures manipulated by OCaml, C and Macle code use the OCaml heap. Bounds of OCaml arrays are dynamically checked at each access.

Measuring elapsed time Macle circuits are called from a C block executed on the softcore. For this, and as described in Sect. 2.2, it is necessary to write arguments in the dedicated registers of the custom component implementing a circuit, start this circuit and wait for the end of the computation to read the result (again in the dedicated registers of the custom component). These reads and writes are done via the Avalon SOPC bus. We measure the execution time of each Macle circuit from the beginning to the end of the corresponding C block. The reported times, therefore, include the time to transfer the arguments and results.

5.2 Macle Core

Pure Computations The throughput of the Macle circuit gcd_rtl given Fig. 5 is of exactly one tail-call per clock tick at 50 Mhz (i.e. 50 million tail-calls per second). We measure the execution time of gcd_rtl compared to that of the C function gcd_c (given Fig. 2a) running on the softcore. The observed Macle vs. C speedup factor is \(28\times \). The hardware implementation of gcd_rtl uses approximately 360 logic elements (LEs), i.e. 0.75% of the total number of LEs available on the FPGA. Figure 10 summarizes these results and gives similar examples of tail-recursive functions expressed in Macle vs. C, both called from an OCaml program executed by O2B. This benchmark comprises the greatest common divisor (GCD), the recursive sum of the n-th first positives integers (SUM_INT), The Fibonacci sequence (FIBONACCI), a tail-recursive version of the McCarthy 91 function (F91), and the Collatz sequence (COLLATZ) as defined Fig. 5a in Macle. The key point is that these “pure” Macle circuits have a throughput of one tail-call per clock tick while an iteration in the C code results in a sequence of instructions, hence the speedup of Macle vs. C depends on the nature of the computation. Moreover, the size (in LEs) of the hardware generated from Macle also depends on the nature of the computation.

Parallel computations Fig. 11a gives a circuit sum_gcd2 calling twice a function gcd and combining results. The constructs is implemented by a synchronization barrier involving a parallel composition of two instances of the FSM given Fig. 7. The global execution time of the barrier is the max of the execution times of the expressions (gcd \(\texttt {a}_i\) y), to which is added the execution time of the rest of the computation (here instantaneous). For instance, calling the circuit \(\texttt {sum\_gcd}_2\) with equal arguments \(a_1\) and \(a_2\) doubles the previous \(28\times \) speedup reported in Fig. 10.

Generalizing this example to circuits \(\texttt {sum\_gcd}_n\) (computing n times \(\texttt {gcd\_rtl}\) and summing the results) gives a speedup of \(28\times n\) in Macle vs. C (e.g., \(\texttt {sum\_gcd}_{32}\) is 900 times faster in Macle than in C). This gain is only possible because the gcd local function is inlined n times, the generated hardware using more LEs as shown Fig. 11b.

5.3 Interacting with the OCaml Runtime

Figure 12 depicts the execution time of a Macle circuit sum_array (given Fig. 5a) computing the sum of the elements of an OCaml array. The size n of the array is its number of elements. The input array is filled with the n first positive integers.

In Fig. 12a, the OCaml heap is limited to 64 KB and is allocated either in on-chip memory or in external memory (SDRAM). Access times are longer (around two times longer on this example) using external memory than on-chip memory. This is still reasonable, considering that this allows to manipulate much larger data structures.

In Fig. 12b, the OCaml heap is allocated in SDRAM (with a heap size of 4 MB). The Macle version is 4.8 times faster than the C one. Compared to the speedup obtained on pure computations, this example highlights a bottleneck when using memory accesses from Macle code.

Figure 13a shows the execution time of a Macle circuit matrix_multiply multiplying two \(n\times n\) matrices filled with positive integers smaller than n, vs. a C version. The Macle version (using a classic formulation with three nested loops) is 7.5 times faster than the C one. The generated hardware uses 1602 LEs.

Figure 13b shows the execution time of the Macle circuit eval_exp (given Fig. 5a) vs. a C version, recursively evaluating trees of arithmetic expressions of various sizes (in number of constants and variables). The Macle version is 13 times faster than the C formulation. The realization of this Macle circuit uses 2461 LEs and an explicit stack of 3072 words implemented in on-chip memory (using RAM blocks and without requiring bus accesses).

This preliminary evaluation shows that reformulating side-effect-free C functions as Macle circuits can bring substantial speedups (eg., up to \(28\times \) for the gcd_rtl of Fig. 5). Replicating the hardware corresponding to these circuits, intrinsically resulting in their parallel execution, allows to further boost these speedups (e.g., up to \(960\times \) for the \(\texttt {sum\_gcd}_{32}\) example given Fig. 10).

It also show that for large data structures, such as arrays, the cost of accessing the corresponding memory can quickly create a bottleneck.

6 Optimised Transfers and Parallel Skeletons

As demonstrated in the previous section, allowing Macle circuits to manipulate values stored in the OCaml heap has a cost. Because this heap is implemented in shared memory, each access requires a bus transaction. When manipulating large data structures, like arrays, the corresponding overhead can quickly become prohibitive. To overcome this problem, Macle provides parallel skeletons aiming at minimizing this overhead and offering higher-level parallelism. These skeletons are listed Fig. 14.

Each skeleton is parameterized by an integer literal k, which statically specifies the size of a buffer used internally to transfer slices of the source and/or destination arrays between the OCaml heap and the Macle circuits:

copies the k first elements of the OCaml array src into a VHDL array buf, computes the function f in parallel on each element of buf and writes back the k resulting values in the OCaml array dst. Processing the whole OCaml array is carried out by iterating this transfer-execution-transfer sequence.
sequentialy reduces the OCaml array a with the function f and an accumulator initialized to init by processing array elements k by k. If the body of f is a combinatorial expression (i.e., a constant, a variable or an operator applied to combinatorial expressions), each transfer-execution sequence is pipelined thus avoiding the use of a buffer.
sequentialy reduces the OCaml array src with the function f and an accumulator initialized to init by processing array elements k by k and write back each group of k intermediate steps in the OCaml array dst.

Example 1

Figure 15a illustrates the use of a skeleton to define an optimized version sum_array_optimized of the Macle circuit sum_array (given Fig. 5a). Figure 15b gives the execution time of the circuit sum_array_optimized vs. the circuit sum_array. Using , the size of the array is calculated only once rather than at each array access. Moreover, since the body of the reduction function f is a combinatorial expression, the execution is piplined and the generated code does not need to use any buffer. The size of the circuit does not depend on k. The sum_array and sum_array_optimized circuits use respectively arround 570 and 590 LEs. The optimized version is 2.8 times faster than the unoptimized one. Figure 13a showed that the Macle circuit sum_array is 4.8 times faster than C. As a result, there is a speedup of \(4.8 \times 2.8 = 13\) when using sum_array_optimized vs. the C version.

Example 2

We consider an OCaml program using a Macle circuit filter_mul\(_k\) replacing all multiples of an integer y by zero in an OCaml array a of size n containing integer from 1 to n. This Macle circuit uses a parallelism skeleton processing the array a in parallel by slice of k elements. Figure 16a gives the corresponding program for \(k:=64\).

Figure 16b shows the execution times of the Macle circuit filter_mul\(_k\) for different k vs. a C sequential version. Doubling the degree of parallelism k almost doubles both the size of the generated hardware and the speedup (taking into account the transfer time). For instance, when \(n = 96{,}000\), filter_mul\(_{1}\) is 28 times faster than the C version, while filter_mul\(_{64}\) is 60 times faster than filter_mul\(_1\), resulting in a cumulated speedup of \(28 \times 60 = 1680\). This follows a classical space-time trade-off as shown by Fig. 16c, given the sizes (in LEs) of filter_mul\(_{k}\) for different k.

Comparison with sequential code running on a PC. We assess the performance of the Macle circuit filter_mul_64 (given Fig. 16a) vs. a C sequential version running on a PC equipped with an Intel Core i7 at 2.2 GHz and 16 GB of RAM. This C code is compiled with gcc option -Os and called from an OCaml program compiled to native code with the ocamlopt compiler. The frequency ratio between the PC and the FPGA is \(2.2 G/50 M = 44\). Due to licensing limitations, the Nios II architecture used in our experiment is a basic, unoptimized one. The sequential C version running on the softcore is \(44 \times 8.7 = 380\) slower than the same C code running on the PC. Thus, on this small benchmark, Macle code synthesized on the FPGA is \(1,\!680/380 = 4.4\) times faster than an equivalent C sequential code running on a PC.

7 Conclusion

In this paper, we have proposed an approach for programming FPGAs using the OCaml language. This approach consists in:

running OCaml programs by embedding their bytecode and the OCaml VM in a C program running on a softcore processor;
calling hardware-accelerated functions, user-defined in the Macle language from OCaml.

Macle is a functional-imperative subset of OCaml supporting:

parallel and sequential compositions of computations;
mixing computations with sequential accesses to the OCaml heap (within the dynamic memory of the softcore processor);
use of parallelism skeletons on dynamic data structures with optimization of memory transfers.

Hardware acceleration of OCaml functions is simply obtained by replacing a “” keyword in the original OCaml code by “”. This facilitates porting of OCaml applications, quick prototyping and debugging. Moreover, Macle is a statically typed language that provides much stronger guarantees on the safety of the generated hardware than classical HDLs.

We have presented an implementation of the proposed approach based on the O2B framework augmented with a Macle compiler targeting VHDL. This compilation flow is fully automatized on an Intel FPGA. It is simple to use and includes a simulation mode generating OCaml code from different points of the Macle compiler to test the applications on a PC before synthesizing them on the FPGA. The use of a local stack implemented in on-chip memory (instead of LEs) to realize non-tail recursive Macle functions (as evoked in Sect. 5.3) is a key point to allow large and complex symbolic computations to be implemented on moderately sized FPGAs.

Preliminary results, obtained on small benchmarks are very encouraging. They show in particular that important speedups (up to the three orders of magnitude, compared to C code running on the embedded softcore) can be obtained by combining the ability to compile a Macle function to hardware and the possibility to replicate the corresponding hardware in order to exploit data parallelism. Parametrizable parallel skeletons both offer a manner to address the bottleneck occurring when exchanging data between the OCaml host program and the accelerated Macle functions. It is also a very practical way to explore the space-time trade-off, which constitutes a classical issue when programming FPGAs (reducing computing time by increasing the number of logic elements used).

The work described in this paper offers many interesting paths for future work.

First of all, scaling up for larger applications is an important point to convince the OCaml community to use FPGAs, and the FPGA community to use high-level languages. From a programmer’s point of view, it would be useful to allocate values from the Macle code, support concurrent memory access, share Macle local functions (rather than inline them), and use more parallel skeletons, possibly, domain-specific.

Concerning the tool chain itself, we plan to switch to fully open source design and synthesis tools, with the idea that using such tools would facilitate both the static analysis of the Macle code and prediction of the efficiency of the generated hardware (e.g., resource usage and execution time). These informations could be used, for example, to decide which Macle function should be inlined and also to provide guarantees on applications interacting with the outside world. This is especially necessary for critical applications, for which it would be appropriate to use synchronous programming models with similarities to the HSML intermediate language used in the Macle compiler.

One can expect much higher speedups by using faster and/or more resourceful FPGA boards. Moreover, although the use of softcore processors leads to some inefficiencies, it remains very suitable for multi-core programming, each core carrying an instance of the OCaml VM.

In the longer term, we could also explore other ways to accelerate both the runtime (memory and exception management) and the VM interpreter by partially implementing them in hardware, or even using different levels of parallelism such as multiple VMs sharing Macle code. The latter could provide an interesting approach to exploit heterogeneous platforms including multi-cores, GPUs and FPGAs.

Notes

https://github.com/janestreet/hardcaml.
A softcore processor is a processor implemented in the reconfigurable part of an FPGA.
Such as AMD Vivado HLS and Intel OpenCL SDK.
https://github.com/jserot/O2B.
https://github.com/lsylvestre/macle.
This process is general and can be adapted to target other FPGA families.
Avalon bus for Intel platforms.
The FPGA board is connected to a host PC via an UART connection for printing and debugging.
Each call of these functions is specialized and inlined at compile time.

References

Alur, R., Kannan, S., Yannakakis, M.: Communicating hierarchical state machines. In: International Colloquium on Automata, Languages, and Programming, pp. 169–178. Springer (1999). https://doi.org/10.1007/3-540-48523-6_14
Auerbach, J., Bacon, D.F., Cheng, P., et al.: Lime: a Java-compatible and synthesizable language for heterogeneous architectures. In: ACM International Conference on Object Oriented Programming Systems Languages and Applications, pp. 89–108 (2010). https://doi.org/10.1145/1869459.1869469
Baaij, C., Kooijman, M., Kuper, J., et al.: Clash: structural descriptions of synchronous hardware using Haskell. In: 2010 13th Euromicro Conference on Digital System Design: Architectures, Methods and Tools, pp. 714–721. IEEE (2010). https://doi.org/10.1109/DSD.2010.21
Bachrach, J., Vo, H., Richards, B., et al.: Chisel: constructing hardware in a Scala embedded language. In: DAC Design Automation Conference, 2012, pp. 1212–1221. IEEE (2012). https://doi.org/10.1145/2228360.2228584
Canis, A., Choi, J., Aldham, M., et al.: LegUp: high-level synthesis for FPGA-based processor/accelerator systems. In: Proceedings of the 19th ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA), pp. 33–36 (2011). https://doi.org/10.1145/1950413.1950423
Cardoso, J.M., Diniz, P.C., Weinhardt, M.: Compiling for reconfigurable computing: a survey. ACM Comput. Surv. (CSUR) 42(4), 1–65 (2010). https://doi.org/10.1145/1749603.1749604
Article Google Scholar
Chi, Y., Guo, L., Lau, J., et al.: Extending high-level synthesis for task-parallel programs. In: 2021 IEEE 29th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), pp. 204–213. IEEE (2021). https://doi.org/10.1145/3431920.3439470
Colaço, J.-L., Hamon, G., Pouzet, M.: Mixing signals and modes in synchronous data-flow systems. In: Proceedings of the 6th ACM & IEEE International Conference on Embedded Software, pp. 73–82 (2006). https://doi.org/10.1145/1176887.1176899
Danelutto, M., Mencagli, G., Torquati, M., et al.: Algorithmic skeletons and parallel design patterns in mainstream parallel programming. Int. J. Parallel Program. 49, 177–198 (2021). https://doi.org/10.1007/s10766-020-00684-w
Article Google Scholar
Decaluwe, J.: MyHDL: a Python-based hardware description language. Linux J. 2004, 84–87 (2004)
Google Scholar
Drusinsky, D., Harel, D.: Using statecharts for hardware description and synthesis. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 8(7), 798–807 (1989). https://doi.org/10.1109/43.31537
Article Google Scholar
Fumero, J., Stratikopoulos, A., Kotselidis, C.: Running parallel bytecode interpreters on heterogeneous hardware. In: 4th International Conference on Art, Science, and Engineering of Programming, pp. 31–35 (2020). https://doi.org/10.1145/3397537.3397563
Gammie, P.: Synchronous digital circuits as functional programs. ACM Comput. Surv. (CSUR) 46(2), 1–27 (2013). https://doi.org/10.1145/2543581.2543588
Article Google Scholar
Ghica, D.R., Smith, A., Singh, S.: Geometry of synthesis IV: compiling affine recursion into static hardware. In: Proceedings of the 16th ACM SIGPLAN International Conference on Functional Programming, pp. 221–233 (2011). https://doi.org/10.1145/2034574.2034805
Huang, S., Wu, K., Jeong, H., et al.: Pylog: an algorithm-centric python-based FPGA programming and synthesis flow. IEEE Trans. Comput. 70(12), 2015–2028 (2021). https://doi.org/10.1109/TC.2021.3123465
Article MATH Google Scholar
Ito, Y., Nakano, K.: A hardware-software cooperative approach for the exhaustive verification of the Collatz conjecture. In: 2009 IEEE International Symposium on Parallel and Distributed Processing with Applications, pp. 63–70. IEEE (2009). https://doi.org/10.1109/ISPA.2009.35
Kennedy, A.: Compiling with continuations, continued. In: 12th ACM SIGPLAN International Conference on Functional Programming, pp. 177–190 (2007). https://doi.org/10.1145/1291151.1291179
Lai, Y.-H., Ustun, E., Xiang, S., et al.: Programming and synthesis for software-defined FPGA acceleration: status and future prospects. ACM Trans. Reconfig. Technol. Syst. (TRETS) 14(4), 1–39 (2021). https://doi.org/10.1145/3469660
Article Google Scholar
Maas, M., Asanović, K., Kubiatowicz, J.: A hardware accelerator for tracing garbage collection. In: 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), pp. 138–151. IEEE (2018). https://doi.org/10.1109/ISCA.2018.00022
Mycroft, A., Sharp, R.: A statically allocated parallel functional language. In: International Colloquium on Automata, Languages, and Programming, pp. 37–48. Springer (2000). https://doi.org/10.1007/3-540-45022-X_5
Nane, R., Sima, V.-M., Pilato, C., et al.: A survey and evaluation of fpga high-level synthesis tools. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 35(10), 1591–1604 (2015). https://doi.org/10.1109/TCAD.2015.2513673
Article Google Scholar
Papadimitriou, M., Fumero, J., Stratikopoulos, A., et al.: Transparent compiler and runtime specializations for accelerating managed languages on FPGAs. Art Sci. Eng. Program. (2020). https://doi.org/10.22152/programming-journal.org/2021/5/8
Article Google Scholar
Saint-Mleux, X., Feeley, M., David, J.-P.: SHard: a Scheme to hardware compiler. In: Workshop on Scheme and Functional Programming (2006)
Segal, O., Margala, M., Chalamalasetti, S.R., et al.: High level programming framework for FPGAs in the data center. In: 2014 24th International Conference on Field Programmable Logic and Applications (FPL), pp. 1–4. IEEE (2014). https://doi.org/10.1109/FPL.2014.6927442
Singh, S., Greaves, D. J.: Kiwi: synthesis of fpga circuits from parallel programs. In: 16th International Symposium on Field-Programmable Custom Computing Machines, pp. 3–12. IEEE (2008). https://doi.org/10.1109/FCCM.2008.46
Stewart, R., Duncan, K., Michaelson, G., et al.: RIPL: a parallel image processing language for FPGAs. ACM Trans. Reconfig. Technol. Syst. (TRETS) 11(1), 1–24 (2018). https://doi.org/10.1145/3180481
Article Google Scholar
Townsend, R., Kim, M.A., Edwards, S.A.: From functional programs to pipelined dataflow circuits. In: Proceedings of the 26th International Conference on Compiler Construction, pp. 76–86 (2017). https://doi.org/10.1145/3033019.3033027
Tsai, C.-J., Kuo, H.-W., Lin, Z., et al.: A Java processor IP design for embedded SoC. ACM Trans. Embed. Comput. Syst. 14(2), 1–25 (2015). https://doi.org/10.1145/2629649
Article Google Scholar
Varoumas, S., Vaugon, B., Chailloux, E.: A generic virtual machine approach for programming microcontrollers: the OMicroB project. In: 9th European Congress on Embedded Real Time Software and Systems (ERTS 2018) (2018)

Download references

Acknowledgements

Work on O2B and Macle is partially supported by the Center for Research and Innovation on Free Software (IRILL).

Author information

Authors and Affiliations

CNRS, LIP6, Sorbonne Université, 75005, Paris, France
Loïc Sylvestre & Emmanuel Chailloux
Clermont Auvergne INP, CNRS, Institut Pascal, Université Clermont Auvergne, 63000, Clermont-Ferrand, France
Jocelyn Sérot

Authors

Loïc Sylvestre
View author publications
You can also search for this author in PubMed Google Scholar
Emmanuel Chailloux
View author publications
You can also search for this author in PubMed Google Scholar
Jocelyn Sérot
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors have contributed equally to this work.

Corresponding author

Correspondence to Jocelyn Sérot.

Ethics declarations

Conflict of interest

The authors have no relevant financial or non-financial interests to disclose. No special funding was received for conducting this work. All of the material is owned by the authors and no permissions are required.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Sylvestre, L., Chailloux, E. & Sérot, J. Accelerating OCaml Programs on FPGA. Int J Parallel Prog 51, 186–207 (2023). https://doi.org/10.1007/s10766-022-00748-z

Download citation

Received: 09 September 2022
Accepted: 21 November 2022
Published: 24 January 2023
Issue Date: June 2023
DOI: https://doi.org/10.1007/s10766-022-00748-z

Accelerating OCaml Programs on FPGA

Abstract

Similar content being viewed by others

High Performance Stream Processing on FPGA

Using FPGA Devices to Accelerate Tree-Based Genetic Programming: A Preliminary Exploration with Recent Technologies

A Special-Purpose Language for Implementing Pipelined FPGA-Based Accelerators

1 Introduction

2 The O2B Framework

2.1 Compilation Flow for OCaml to FPGAs

2.2 Calling Accelerators from OCaml Programs

3 High-Level FPGA Programming

3.1 Compilation Flow

3.2 The Macle Language

4 Compiling Macle

4.1 Targeting the Register Transfer Level

4.2 An FSM-Based Intermediate Language

4.3 Compiling Macle Core to HSML

5 Examples and Benchmarks

5.1 Methodology

5.2 Macle Core

5.3 Interacting with the OCaml Runtime

6 Optimised Transfers and Parallel Skeletons

Example 1

Example 2

7 Conclusion

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation