# Formal Verification of Hardware Synthesis Thomas Braibant<sup>1</sup> and Adam Chlipala<sup>2</sup> <sup>1</sup> Inria Paris-Rocquencourt <sup>2</sup> MIT CSAIL **Abstract.** We report on the implementation of a certified compiler for a high-level hardware description language (HDL) called Fe-Si (FEatherweight Synthesis). Fe-Si is a simplified version of Bluespec, an HDL based on a notion of guarded atomic actions. Fe-Si is defined as a dependently typed deep embedding in Coq. The target language of the compiler corresponds to a synthesisable subset of Verilog or VHDL. A key aspect of our approach is that input programs to the compiler can be defined and proved correct inside Coq. Then, we use extraction and a Verilog back-end (written in OCaml) to get a certified version of a hardware design. #### Introduction Verification of hardware designs has been thoroughly investigated, and yet, obtaining provably correct hardware of significant complexity is usually considered challenging and time-consuming. On the one hand, a common practice in hardware verification is to take a given design written in an hardware description language like Verilog or VHDL and argue about this design in a formal way using a model checker or an SMT solver. On the other hand, a completely different approach is to design hardware via a shallow embedding of circuits in a theorem prover [16,14,18,4,11]. Yet, both kinds of approach suffer from the fact that most hardware designs are expressed in low-level register transfer languages (RTL) like Verilog or VHDL, and that the level of abstraction they provide may be too low to do short and meaningful proof of high-level properties. To raise this level of abstraction, industry moved to high-level hardware synthesis using higher-level languages, e.g., System-C [28], Esterel [3] or Bluespec [1], in which a source program is compiled to an RTL description. High-level synthesis has two benefits. First, it reduces the effort necessary to produce a hardware design. Second, writing or reasoning about a high-level program is simpler than reasoning about the (much more complicated) RTL description generated by a compiler. However, the downside of high-level synthesis is that there is no formal guarantee that the generated circuit description behaves exactly as prescribed by the semantics of the source program, making verification on the source program useless in the presence of compiler-introduced bugs. In this paper, we investigate the formal verification of a lightly optimizing compiler from a Bluespec-inspired language called Fe-Si to RTL, applying (to a lesser extent) the ideas behind the CompCert project [19] to hardware synthesis. N. Sharygina and H. Veith (Eds.): CAV 2013, LNCS 8044, pp. 213–228, 2013. Fe-Si can be seen as a stripped-down and idealized version of Bluespec: hardware designs are described in terms of guarded atomic actions on state elements. Guarded atomic actions have a flavour of transactional memory, where updates to state elements are not visible before the end of the transaction (a time-step). Our target language can be sensibly interpreted as clocked sequential machines: we generate an RTL description syntactically described as combinational definitions and next-state assignments. In our development, we define a (dependently typed) deep embedding of the Fe-Si programming language in Coq using parametric higher-order abstract syntax (PHOAS) [7], and give it a semantics using an interpreter: the semantics of a program is a Coq function that takes as inputs the current state of the state elements and produces their next state. Fe-Si hits a sweet spot between deep and shallow embeddings: it makes it possible to use Coq as a meta programming tool to describe circuits, without the pitfalls usually associated with a deep embedding (e.g., the handling of binders). This provides an economical path toward succinct and provably correct description of, e.g., recursive circuits. #### 1 Overview of Fe-Si Fe-Si is a purely functional language built around a *monad* that makes it possible to define circuits. We start with a customary example: a half adder. ``` Definition hadd (a b: Var B) : action [] (B \otimes B) := do carry \leftarrow ret (andb a b); do sum \leftarrow ret (xorb a b); ret (carry, sum). ``` This circuit has two Boolean inputs (Var B) and returns a tuple of Boolean values ( $B \otimes B$ ). Here, we use Coq notations to implement some syntactic sugar: we borrow the do-notation to denote the monadic bind and use ret as a short-hand for return. Up to this point, Fe-Si can be seen as an extension of the Lava [5] language, implemented in Coq rather than Haskell. Yet, using Coq as a metalanguage offers the possibility to use dependent types in our circuit descriptions. For instance, one can define an adder circuit of the following type by induction: In this definition, $\mathbf{n}$ is a formal parameter that denotes the size of the operands and the size of the result. (Note that this definition describes an infinite family of adders that can be proved correct all at once using inductive reasoning. Then, the formal parameter can be instantiated to yield certified finite-size designs.) **Stateful Programs.** Fe-Si also features a small set of primitives for interacting with *memory elements* that hold mutable state. In the following snippet, we build a counter that increments its value when its input is true. ``` Definition \Phi := [\operatorname{Reg} (\operatorname{Int} n)]. Definition count n (tick: Var B) : action \Phi (\operatorname{Int} n) := \operatorname{do} x \leftarrow [\operatorname{member_0}; \operatorname{do} \bot \leftarrow \operatorname{if} \operatorname{tick} \operatorname{then} \{\operatorname{member_0} ::= x + 1\} \operatorname{else} \{\operatorname{ret} ()\}; \operatorname{ret} x ``` Here, $\Phi$ is an environment that defines the set of memory elements (in a broad sense) of the circuit. In the first line, we read the content of the register at position member\_0 in $\Phi$ , and bind this value to x. Then, we test the value of the input tick, and when it is true, we increment the value of the register. In any case, the output is the old value of the counter. The above "if-then-else" construct is defined using two primitives for guarded atomic actions that are reminiscent of transactional memory monads: assert and orElse. The former aborts the current action if its argument is false. The latter takes two arguments a and b, and first executes a; if it aborts, then the effects of a are discarded and b is run. If b aborts too, the whole action a orElse b aborts. Synchronous Semantics. Recall that Fe-Si programs are intended to describe hardware circuits. Hence, we must stress that they are interpreted in a synchronous setting. From a logical point of view the execution of a program (an atomic action) is clocked, and at each tick of its clock, the computation of its effects (i.e., updates to memory elements) is instantaneous: these effects are applied all at once between ticks. In particular this means that it is not possible to observe, e.g., partial updates to the memory elements, nor transient values in memory. (In Bluespec terminology, this is "reads-before-writes".) From Programs to Circuits. At this point, the reader may wonder how it is possible to generate circuits in a palatable format out of Fe-Si programs. Indeed, using Coq as a meta-language to embed Fe-Si yields two kinds of issues. First, Coq lacks any kind of I/O; and second, a Fe-Si program may have been built using arbitrary Coq code, including, e.g., higher-order functions or fixpoints. Note that every Coq function terminates. Therefore, a closed Fe-Si program of type action evaluates to a term that is syntactically built using the inductive constructors of the type action (i.e., all intermediate definitions in Coq have been expanded). Then we use Coq's extraction, which generates OCaml code from Coq programs. Starting from a closed Fe-Si program foo, we put the following definition in a Coq file: Definition bar := fesic foo. The extracted OCaml term that corresponds to bar evaluates (in OCaml) to a closed RTL circuit. Then, we can use an (unverified) back-end that pretty-prints this RTL code as regular Verilog code. (This devious use of Coq's extraction mechanism palliates the fact there is no I/O mechanism in Coq.) #### 2 From Fe-Si to RTL In this section, we shall present our source (Fe-Si) and target (RTL) languages, along with their semantics. For the sake of space, we leave the full description of this compilation process out of the scope of this paper. ### 2.1 The Memory Model Fe-Si programs are meant to describe sequential circuits, whose "memory footprints" must be known statically. We take a declarative approach: each state-holding element that is used in a program must be declared. We currently have three types of memory elements: inputs, registers, and register files. A register holds one value of a given type, while a register file of size n stores $2^n$ values of a given type. An input is a memory element that can only be read by the circuit, and whose value is driven by the external world. We show the inductive definitions of types and memory elements in Fig. 1. We have four constructors for the type ty of types: Unit (the unit type), B (Booleans), Int (integers of a given size), and Tuple (tuples of types). The inductive definition of memory elements (mem) should be self-explaining. We endow these inductive definitions with a denotational semantics: we implement Coq functions that map such reified types to the obvious Coq types they denote. Fig. 1. Types and memory elements #### 2.2 Fe-Si The definition of Fe-Si programs (action in the following) takes the PHOAS approach [7]. That is, we define an inductive type family parameterized by an arbitrary type V of variables, where binders bind variables instead of arbitrary terms (as would be the case using HOAS [21]), and those variables are used explicitly via a dedicated term constructor. The definition of Fe-Si syntax is split in two syntactic classes: expressions and actions. Expressions are side-effect free and are built from variables, constants, and operations. Actions are made of control-flow structures (assertions and alternatives), binders, and memory operations. In this work, we follow an intrinsic approach [2]: we mix the definition of the abstract syntax and the typing rules from the start. That is, the type system of the meta-language (Coq) enforces that all Fe-Si programs are well-typed by construction. Besides the obvious type-oblivious definitions (e.g., it is not possible to add a Boolean and an integer), this means that the definition of operations on state-holding elements requires some care. Here, we use dependently typed de Bruijn indices. ``` Section t. Variable V: ty ightarrow Type. Variable \Phi: list mem. Inductive expr: ty \rightarrow Type := Evar : \forall t (v : V t), expr t Eandb : expr B \rightarrow expr B \rightarrow expr B \mid ... (* operations on Booleans *) Eadd: \forall n, expr (Int n) \rightarrow expr (Int n) \rightarrow expr (Int n) | ... (* operations on words *) | Efst : \forall 1 t, expr (Tuple (t:: 1)) \rightarrow expr t | ... (* operations on tuples *) Inductive action: ty \rightarrow Type:= Return: \forall t, expr t \rightarrow action t Bind: \forall t u, action t \rightarrow (V t \rightarrow action u) \rightarrow action u (* control-flow *) | OrElse: \forall t, action t \rightarrow action t \rightarrow action t Assert: expr B → action Unit (* memory operations on registers *) \mid RegRead : \forall t, member \Phi (Reg t) \rightarrow action t RegWrite: \forall t, member \Phi (Reg t) \rightarrow expr t \rightarrow action Unit (* memory operations on register files, and inputs *) End to Definition Action \Phi t := \forall V, action V \Phi t. ``` Fig. 2. The syntax of expressions and actions Using the above definition, a term of type member $\Phi$ M denotes the fact that the memory element M appears at a given position in the environment of memory elements $\Phi$ . We are now ready to present the (elided) Coq definitions of the inductives for expressions and actions in Fig. 2. (For the sake of brevity, we omit the constructors for accesses to register files, in the syntax and, later, in the semantics. We refer the reader to the supplementary material [13] for more details.) Our final definition Action of actions is a polymorphic function from a choice of variables to an action (we refer the reader to [7] for a more in-depth explanation of this encoding strategy). **Semantics.** We endow Fe-Si programs with a simple synchronous semantics: starting from an initial state, the execution of a Fe-Si program corresponds to a sequence of atomic updates to the memory elements. Each step goes as follows: reading the state, computing an update to the state, committing this update. The reduction rules of Fe-Si programs are defined in Fig. 3. The judgement $\Gamma, \Delta \vdash a \to r$ reads "in the state $\Gamma$ and with the partial update $\Delta$ , evaluating a produces the result r", where r is either None (meaning that the action aborted), or Some (v, $\Delta'$ ) (meaning that the action returned the value v and the partial update $\Delta'$ ). Note that the PHOAS approach makes it possible to manipulate closed terms: we do not have rules for $\beta$ -reduction, because it is implemented by the host language. That is, $\Gamma$ only stores the mutable state, and not the variable values. There are two peculiarities here: first, following the definition of $\oplus$ , if two values are written to a memory element, only the first one (in program order) is committed; second, reading a register yields the value that was held at the beginning of the time step. Fig. 3. Dynamic semantics of Fe-Si programs Finally, we define a wrapper function that computes the next state of the memory elements, using the aforementioned evaluation relation (starting with an empty partial update). ``` \text{Definition Next $\{t\}$ $\{\varPhi\}$ (st: $\llbracket\varPhi\rrbracket$) (A: Action $\varPhi$ t): option $(\llbracket t \rrbracket_{ty} * \llbracket\varPhi\rrbracket$) := \dots $} ``` #### 2.3 RTL Our target language sits at the register-transfer level. At this level, a synchronous circuit can be faithfully described as a set of state-holding elements, and a next-state function, implemented using combinational logic [15]. Therefore, the definition of RTL programs (block in the following) is quite simple: a program is simply a well-formed sequence of bindings of expressions (combinational operations, or reads from state-holding elements), with a list of effects (i.e, writes to state-holding elements) at the end. We show the (elided) definition of expressions and sequences of binders in Fig. 4. The definition of expressions is similar to the one we used for Fe-Si, except that we have constructors for reads from memory elements, and that we moved to "three-adress code". (That is, operands are variables, rather than arbitrary expressions.) A telescope (type scope A) is a well-formed sequence of binders with an element of type A at the end (A is instantiated later with a list of effects). Intuitively, this definition enforces that the first binding of a telescope can only read from memory elements; the second binding may use the first value, or read from memory elements; and so on and so forth. A block is a telescope, with three elements at the end: a guard, a return value, and a (dependently typed) list of effects. The value of the guard (a Boolean) is equal to true when the return value is valid and the state updates must be committed; and false otherwise. The return value denotes the outputs of the ``` Section t. Variable V: ty ightarrow Type. Variable \Phi: list mem. Inductive expr: ty \rightarrow Type := | Evar : \forall t (v : V t), expr t (* read from memory elements *) | Einput : \forall t, member \Phi (Input t) \rightarrow expr t \texttt{Eread\_r}: \ \forall \ \texttt{t}, \ \texttt{member} \ \varPhi \ (\texttt{Reg} \ \texttt{t}) \ \rightarrow \ \texttt{expr} \ \texttt{t} | Eread_rf : \forall n t, member \Phi (Regfile n t) \rightarrow V (Int n) \rightarrow expr t (* Other operations on Booleans, words, tuples, etc. *) | Emux : \forall t, \forall B \rightarrow \forall t \rightarrow \forall t \rightarrow expr t {\tt Eandb}: \ {\tt V} \ {\tt B} \ \to \ {\tt V} \ {\tt B} \ \to \ {\tt V} \ {\tt B} \ | \ \dots {\tt Eadd}: \ \forall \ n, \ {\tt V} \ ({\tt Int} \ n) \ \rightarrow \ {\tt V} \ ({\tt Int} \ n) \ \rightarrow \ {\tt expr} \ ({\tt Int} \ n) \ | \ \dots | Efst : \forall 1 t, \forall (Tuple (t::1)) \rightarrow expr t | ... Inductive scope (A : Type): Type := \mid Send : A \rightarrow scope A Sbind: \forall (t: ty), expr t \rightarrow (V t \rightarrow scope A) \rightarrow scope A. Inductive write: mem \rightarrow Type:= | WR : \forall t, V t \rightarrow V B \rightarrow write (Reg t) | \ \, \mathsf{WRF} \, : \, \, \forall \, \, \mathsf{n} \, \, \mathsf{t}, \, \, \mathsf{V} \, \, \mathsf{t} \, \rightarrow \, \, \mathsf{V} \, \, (\mathsf{Int} \, \, \mathsf{n}) \, \, \rightarrow \, \, \mathsf{V} \, \, \mathsf{B} \, \rightarrow \, \, \, \mathsf{write} \, \, (\mathsf{Regfile} \, \, \mathsf{n} \, \, \mathsf{t}). Definition effects := DList.T (option \circ write) \Phi. Definition block t := scope (V B * V t * effects). Definition Block \Phi t := \forall V. block V \Phi t. ``` Fig. 4. RTL programs with three-adress code expressions circuits. The data type effects encodes, for each memory element of the list $\Phi$ , either an effect (a write of the right type), or None (meaning that this memory element is never written to). (For the sake of brevity, we omit the particular definition of dependently typed heterogeneous lists DList.T that we use here.) **Semantics.** We now turn to define the semantics of our RTL language. First, we endow closed expressions with a denotation function (in the same way as we did at the source level, except that it is not a recursive definition). Note that we instantiate the variable parameter of **expr** with the function [.]<sub>ty</sub>, effectively tagging variables with their denotations. ``` \begin{array}{l} \text{Variable $\varGamma$: $\llbracket \varPhi \rrbracket$.} \\ \text{Definition eval\_expr (t : ty) (e : expr } \llbracket . \rrbracket_{\text{ty}} \text{ t) : } \llbracket . \rrbracket_{\text{ty}} \text{:= } \dots \end{array} ``` The denotation of telescopes is a simple recursive function that evaluates bindings in order and applies an arbitrary function on the final (closed) object. ``` Fixpoint eval_scope {A B} (F : A \rightarrow B) (T : scope [.] _{ty} A) : B := ... ``` The final piece that we need is the denotation that corresponds to the write type. This function takes as argument a single effect, the initial state of this memory location, and either returns a new state for this memory location, or returns None, meaning that location is left in its previous state. ``` \text{Definition eval\_effect } (\texttt{m} : \texttt{mem}) : \texttt{option } (\texttt{write } \llbracket.\rrbracket_{\texttt{ty}} \texttt{m}) \to \llbracket\texttt{m}\rrbracket_{\texttt{mem}} \to \texttt{option } \llbracket\texttt{m}\rrbracket_{\texttt{mem}} := \dots ``` Using all these pieces, it is quite easy to define what is the final next-state function. ``` \texttt{Definition Next}\ \{\mathtt{t}\}\ \{\varPhi\}\ (\varGamma\colon \llbracket\varPhi\rrbracket)\ (\mathtt{B}: \mathtt{Block}\ \varPhi\ \mathtt{t}): \mathtt{option}\ (\llbracket\mathtt{t}\rrbracket_{\mathtt{ty}}*\llbracket\varPhi\rrbracket):=\dots ``` #### 2.4 Compiling Fe-Si to RTL Our syntactic translation from Fe-Si to RTL is driven by the fact that our RTL language does not allow clashing assignments: syntactically, each register and register file is updated by at most one write expression. From Control Flow to Data Flow. To do so, we have to transform the control flow (the Assert and OrElse) of Fe-Si programs into data-flow. We can do that in hardware, because circuits are inherently parallel: for instance, the circuit that computes the result of the conditional expression e? a: b is a circuit that computes the value of a and the value of b in parallel and then uses the value of e to select the proper value for the whole expression. Administrative Normal Form. Our first compilation pass transforms Fe-Si programs into an intermediate language that implements A-normal form. That is, we assign a name to every intermediate computation. In order to do so, we also have to resolve the control flow. To be more specific, given an expression like ``` do x \leftarrow (A \text{ OrElse B}); \dots ``` we want to know statically to what value x needs to be bound and when this value is valid. In this particular case, we remark that if A yields a value $v_A$ which is valid, then x needs to be bound to $v_A$ ; if A yields a value that is invalid, then x needs to be bound to the value returned by B. In any case, the value bound in x is valid whenever the value returned by A or the value returned by B is valid. More generally, our compilation function takes as argument an arbitrary function, and returns a telescope that binds three values: (1) a *guard*, which denotes the validity of the following components of the tuple; (2) a *value*, which is bound by the telescope to denote the value that was returned by the action; (3) a list of *nested effects*, which are a lax version of the effects that exist at the RTL level. The rationale behind these nested effects is to represent trees of conditional blocks, with writes to state-holding elements at the leaves. (Using this data type, several paths in such a tree may lead to a write to a given memory location; in this case, we use a notion of program order to discriminate between clashing assignments.) Linearizing the Effects. Our second compilation pass flattens the nested effects that were introduced in the first pass. The idea of this translation is to associate two values to each register: a data value (the value that ought to be written) and a write-enable value. The data value is committed (i.e., stored) to the register if the write-enable Boolean is true. Similarly, we associate three values to each register-file: a data value, an address, and a write-enable. The data is stored to the field of the register file selected by the address if the write-enable is true. The heart of this translation is a merge function that takes two writes of the same type, and returns a telescope that encapsulates a single write: ``` Definition merge s (a b : write s): scope (option (write s)) := ... ``` For instance, in the register case, given $(v_a, e_a)$ (resp. $(v_b, e_b)$ ) the data value and the write-enable that correspond to a, the write-enable that corresponds to the merge of a and b is $e_a||e_b$ , and the associated data value is $e_a$ ? $v_a$ : $v_b$ . Moving to RTL. The third pass of our compiler translates the previous intermediate language to RTL, which amounts to a simple transformation into three-address code. This transformation simply introduces new variables for all the intermediate expressions that appear in the computations. #### 2.5 Lightweight Optimizations We will now describe two optimizations that we perform on programs expressed in the RTL language. The first one is a syntactic version of common sub-expression elimination, intended to reduce the number of bindings and introduce more sharing. The second is a semantic common sub-expression elimination that aims to reduce the size of the Boolean formulas that were generated in the previous translation passes. Syntactic Common-subexpression Elimination. We implement CSE with a simple recursive traversal of RTL programs. (Here we follow the overall approach used by Chlipala [8].) Contrary to our previous transformations that were just "pushing variables around" for each possible choice of variable representation V, here we need to tag variables with their symbolic values, which approximate the actual values held by variables. Then, CSE goes as follows. We fold through a telescope and maintain a mapping from symbolic values to variables. For each binder of the telescope, we compute the symbolic representation of the expression that is bound. If this symbolic value is already in the map, we avoid the creation of an extraneous binder. Otherwise, we do create a new binder, and extend our association list accordingly. Using BDDs to Reduce Boolean Expressions. Our compilation process introduces a lot of extra Boolean variables. We use BDDs to implement semantic common-subexpression elimination. We implemented a BDD library in Coq; and we use it to annotate each Boolean expression of a program with an approximation of its runtime value, i.e., a pointer to a node in a BDD. Our use of BDDs boils down to hash-consing: it enforces that Boolean expressions that are deemed equivalent are shared. #### 2.6 Putting It All Together In the end, we prove that our whole compiler is semantics preserving. ``` \label{eq:def-variable} \begin{array}{l} \text{Variable}\; (\varPhi\colon \text{list mem})\; (\texttt{t}:\; \texttt{ty}). \\ \text{Definition fesic} \quad (\texttt{A}:\; \text{Fesi.Action}\; \varPhi\; \texttt{t}): \text{RTL.Block}\; \varPhi\; \texttt{t}:= \\ \text{let}\; \texttt{x}:=\; \text{IR.Compile}\; \varPhi\; \texttt{t}\; \texttt{A}\; \text{in let}\; \texttt{x}:= \text{RTL.Compile}\; \varPhi\; \texttt{t}\; \texttt{x}\; \text{in let}\; \texttt{x}:= \text{CSE.Compile}\; \varPhi\; \texttt{t}\; \texttt{x}\; \text{in BDD.Compile}\; \varPhi\; \texttt{t}\; \texttt{x}. \\ \\ \text{Theorem fesic\_correct}:\; \forall\; \texttt{A}\; (\varGamma\; : \llbracket \varPhi \rrbracket),\; \text{Front.Next}\; \varGamma\; \texttt{A}= \text{RTL.Next}\; \varGamma\; (\text{fesic A}). \\ \end{array} ``` A corollary of this theorem is that, given a certified Fe-Si design, our compiler produces correct by construction RTL code. Therefore, in the following, we will focus on the verification of (families of) Fe-Si designs. ## 3 Design and Verification of a Sorter Core We now turn to the description of a first hardware circuit implemented and proved correct in Fe-Si. A *sorting network* [10] is a parallel sorting algorithm that sorts a sequence of values using only compare-and-swap operations, in a data-independent way. This makes it suitable for a hardware implementation. Bitonic sorters for sequences of length $2^n$ can be generated using short and simple algorithmic descriptions. Yet, formally proving their correctness is a challenge that was only partially solved in two different lines of previous work. First, sorter core generators were studied from a hardware design perspective in Lava [9], but formal proof is limited to circuits with a fixed size – bounded by the performances of the automated verification tools. Second, machine-checked formal proofs of bitonic sort were performed e.g., in Agda [6], but without a connection with an actual hardware implementation. Our main contribution here is to implement such generators and to propose a formal proof of their correctness. More precisely, we implemented a version of bitonic sort as a regular Coq program and proved that it sorts its inputs. This proof follows closely the one described by Bove and Coquand [6] – in Agda – and amounts to roughly 1000 lines of Coq, including a proof of the so-called 0-1 principle<sup>1</sup>. Then, we implemented a version of bitonic sort as a Fe-Si program, which mimicked the structure of the previous one. We present side-by-side the Coq implementation of **reverse** in Fig. 5. The version on the left-hand side can be seen as a specification: it takes as argument a sequence of $2^n$ inputs (represented as a complete binary tree of depth n) and reverses the order of this sequence. The code on the right-hand side implements part of the connection-pattern of the sorter. More precisely, it takes as input a sequence of input variables and builds a circuit that outputs this sequence in reverse order. Next, we turn to the function that is at the heart of the bitonic sorting network. A bitonic sequence is a sequence $(x_i)_{0 \le i < n}$ whose monotonicity changes fewer than two times, i.e., $$x_0 \le \cdots \le x_k \ge \cdots x_n$$ , with $0 \le k < n$ or a circular shift of such a sequence. Given a bitonic input sequence of length $2^n$ , the left-hand side $\min_{\max}$ returns two bitonic sequences of length $2^{n-1}$ , such that all elements in the first sequence are smaller or equal to the elements in the second sequence. The right-hand side version of this function builds the corresponding comparator network: it takes as arguments a sequence of input variables and returns a circuit. <sup>&</sup>lt;sup>1</sup> That is, a (parametric) sorting network is valid if it sorts all sequences of 0s and 1s. ``` (* Lists of length 2^n represented as trees *) Variable A : ty. Inductive tree (A: Type): nat \rightarrow Type := Fixpoint domain n := match n with | L : \forall x : A, tree A 0 | N : \forall n (1 r : tree A n), tree A (S n). 0 \Rightarrow A S n \Rightarrow (domain n) \otimes (domain n) end. Definition leaf \{A\ n\} (t: tree A 0) : A := ... Definition left \{A\ n\} (t: tree A (S n)) : tree A n := ... Definition right \{A\ n\} (t: tree A (S n)) : tree A n := ... Notation T n := tree (expr Var A) n. Notation C n := action nil Var (domain n). Fixpoint reverse {A} n (t : tree A n) := Fixpoint reverse n (t : T n) : C n := match t with match t with L x \Rightarrow L x | L x \Rightarrow ret x | N n 1 r \Rightarrow | N n l r \Rightarrow let r := (reverse n r) in do r ← reverse n r; do 1 ← reverse n 1; let 1 := (reverse n 1) in Nnrl ret [tuple r, 1] end. Notation mk_N x y := [tuple x, y]. Variable cmp: A \rightarrow A \rightarrow A \ast A. {\tt Variable\;cmp}:\;{\tt Var\;A}\;\to\;{\tt Var\;A} → action nil Var (A ⊗ A). Fixpoint min_max_swap {A} n : Fixpoint min_max_swap n : ∀ (1 r : tree A n), tree A n * tree A n := ∀ (1 r : T n), C (S n) := match n with match n with | 0 \Rightarrow \text{fun l r} \Rightarrow | 0 \Rightarrow \text{fun l r} \Rightarrow \mathtt{let}\ (\mathtt{x},\mathtt{y}) \, := \, \mathtt{cmp}\ (\mathtt{leaf}\ \mathtt{l})\ (\mathtt{leaf}\ \mathtt{r})\ \mathtt{in}\ (\mathtt{L}\ \mathtt{x},\ \mathtt{L}\ \mathtt{y}) cmp (leaf 1) (leaf r) | S p \Rightarrow fun 1 r \Rightarrow | S p \Rightarrow fun 1 r \Rightarrow let (a,b) := min_max_swap p (left 1) (left r) in do a,b ← min_max_swap p (left 1) (left r); do c,d ← min_max_swap p (right 1) (right r); let (c,d) := min_max_swap p (right 1) (right r) in ret ([tuple mk_N a c, mk_N b d]) (Npac, Npbd) end. end. Fixpoint sort n : tree A n \rightarrow tree A n := ... Fixpoint sort n : T n \rightarrow C n := ... ``` Fig. 5. Comparing the specification and the Fe-Si implementation We go on with the same ideas to finish the Fe-Si implementation of bitonic sort. The rest of the code is unsurprising, except that it requires to implement a dedicated bind operation of type ``` \forall \ (\texttt{U: ty}) \ \texttt{n}, \quad \texttt{Var} \ (\texttt{domain} \ \texttt{n}) \ \rightarrow \ (\texttt{T} \ \texttt{n} \ \rightarrow \ \texttt{action} \ [] \ \ \texttt{Var} \ \texttt{U}) \ \rightarrow \ \texttt{action} \ [] \ \ \texttt{Var} \ \texttt{U}. ``` that makes it possible to recover the tree structure out of the result type of a circuit (domain n). We are now ready to state (and prove) the correctness of our sorter core. We chose to fix a context where the type of data that we sort are integers of a given size, but we could generalize this proof to other data types, e.g., to sort tuples in a lexicographic order. We side-step the presentation of some of our Coq encoding, to present the final theorem in a stylized fashion. **Theorem 1.** Let I be a sequence of length $2^n$ of integers of size m. The circuit always produces an output sequence that is a sorted permutation of I. (Note that this theorem states the correctness of the Fe-Si implementation against a specification of sorted sequences that is independent of the implementation of the sorting algorithm in the left-hand side of Fig. 5; the latter only serves as a convenient intermediate step in the proof.) **Testing the Design.** Finally, we include ourselves and test a design that was formally proven, using a stock Verilog simulator [17]. We set the word size and the number of inputs of the sorter, and we generate the corresponding Verilog code. Unsurprisingly, the sorter core sorts its input sequence in every one of our test runs. ## 4 Verifying a Stack Machine The circuit that was described in the previous section is a simple combinational sorter: we could have gone one step further in this verification effort and pipelined our design by registering the output of each compare-and-swap operator. However, we chose here to describe a more interesting design: a hardware implementation of a simple stack machine for the IMP language, i.e., a tiny subset of the Java virtual machine. Again, we proceed in two steps: first, we define a specification of the behavior of our stack machine; second, we build a Fe-Si implementation and prove that it behaves as prescribed. The instruction set of our machine is given in Fig. 6, where we let x range over identifiers (represented as natural numbers) and $n, \delta$ range over values (natural numbers). The state of the machine is composed of the code (a list of instruction), a program counter (an integer), a variable stack (a list of values), and a store (a mapping from variables to values). The semantics of the machine is given by a one-step transition relation in Fig. 6. Note that this specification uses natural numbers and lists in a pervasive manner: this cannot be faithfully encoded using finite-size machine words and register files. For simplicity reasons, we resolve this tension by adding some dynamic checks (that do not appear explicitly on Fig. 6) to the transition relation to rule out such ill-defined behaviors. (Note that this is an alternative to the use of finite-size machine words in the model; it would catch assembly programs with bugs related to overflows.) The actual Fe-Si implementation is straightforward. The definition of the internal state is depicted below: the stack is implemented using a register file and a stack pointer; the store is a simple register file; the code is implemented as another register file that is addressed by the program counter. The implementation of the machine is unsurprising: we access the code memory at the address given by the program counter; we case-match over the value of the opcode and update the various elements of the machine state accordingly. For the sake of space, we only present the code for the $\mathtt{setvar}\ x$ instruction below. ``` \begin{array}{lll} \text{Definition pop} := & & \text{Definition Isetvar pc i} := \\ \text{do sp} \leftarrow ! \text{ SP}; & \text{do v} \leftarrow \text{pop}; \\ \text{do x} \leftarrow \text{ read STACK [: sp - 1];} & \text{do } \_ \leftarrow \text{ write REGS [: snd i} \leftarrow \text{v];} \\ \text{do } \_ \leftarrow \text{ SP} ::= \text{sp} - 1; & \text{PC} ::= \text{pc} + 1. \end{array} ``` ``` i := const n C \vdash pc, \sigma, s \rightarrow pc + 1, n :: \sigma, s if C(pc) = const n if C(pc) = \text{var } x | var x C \vdash pc, \sigma, s \rightarrow pc + 1, s(x) :: \sigma, s C \vdash pc, v :: \sigma, s \to pc + 1, \sigma, s[x \leftarrow v] if C(pc) = setvar x setvar x C \vdash pc, n_2 :: n_1 :: \sigma, s \rightarrow pc + 1, (n_1 + n_2) :: \sigma, s \text{ if } C(pc) = \mathsf{add} add C \vdash pc, n_2 :: n_1 :: \sigma, s \rightarrow pc + 1, (n_1 - n_2) :: \sigma, s \text{ if } C(pc) = \text{sub} suh if C(pc) = bfwd \delta bfwd \delta C \vdash pc, \sigma, s \rightarrow pc + 1 + \delta, \sigma, s bbwd \delta C \vdash pc, \sigma, s \rightarrow pc + 1 - \delta, \sigma, s if C(pc) = bbwd \delta boond c \delta C \vdash pc, n_2 :: n_1 :: \sigma, s \to pc + 1 + \delta, \sigma, s if C(pc) = bcond c \delta and c n_1 n_2 C \vdash pc, n_2 :: n_1 :: \sigma, s \rightarrow pc + 1, \sigma, s if C(pc) = bcond c \delta and \neg (c n_1 n_2) halt no reduction ``` Fig. 6. Instruction set and transition relation of our stack machine Correctness. We are now ready to prove that our hardware design is a sound implementation of its specification. First, we define a logical relation that relates the two encodings of machine state (in the specification and in the implementation), written $\equiv$ . Then, we prove a simulation property between related states. **Theorem 2.** Let $s_1$ be a machine state as implemented in the specification and $m_1$ the machine state as implemented in the circuit, such that $s_1 \equiv m_1$ . If $s_1$ makes a transition to $s_2$ , then $m_1$ makes a transition to $m_2$ such that $s_2 \equiv m_2$ . Note that we do not prove completeness here: it is indeed the case that our implementation exhibits behaviors that cannot be mapped to behaviors of the specification. In this example, the specification should be regarded as an abstraction of the actual behaviors of the implementation, which could be used to reason about the soundness of programs, either written by hand or produced by a certified compiler. **Testing the Design.** Again, we compiled our Fe-Si design to an actual Verilog implementation. We load binary blobs that correspond to test programs in the code memory, and run it while monitoring the content of given memory locations. This gives rise to an highly stylized way of computing e.g., the Fibonacci sequence. # 5 Comparison with Related Work Fe-Si marries hardware design, functional programming and inductive theorem proving. We refer the reader to Sheeran [25] for a review of the use of functional programming languages in hardware design, and only discuss the most closely related work. Lava [5] embeds parametric circuit generators in Haskell. Omitting the underlying implementation language, Lava can be described as a subset of Fe-Si, with two key differences. First, Lava features a number of layout primitives, which makes it possible to describe more precisely what should be the hardware layout, yielding more efficient FPGA implementation. We argue that these operators are irrelevant from the point of view of verification and could be added to Fe-Si if needed. Second, while Lava features "non-standard" interpretations of circuits that make it possible to prove the correctness of fixed-size tangible representations of circuits, our embedding of Fe-Si in Coq goes further: it makes it possible to prove the correctness of parametric circuit generators. Bluespec SystemVerilog [20] (BSV) is an industrial strength hardware description language based on non-deterministic guarded atomic actions. BSV features a module system, support for polymorphic functions and modules, support for scheduling directives, and support for static elaboration of programs (e.g., loops to express repetitive code). After static elaboration, a program is a set of rewrite rules in a Term Rewriting System that are non-deterministically executed one at a time. To implement a Bluespec program in hardware, the Bluespec compiler needs to generate a deterministic schedule where one or more rules happen each clock-cycle. Non-determinism makes it possible to use Bluespec both as an implementation language and as a specification language. Fe-Si can be described as a deterministic idealized version of Bluespec. We argue that deterministic semantics are easier to reason with, and that we can use the full power of Coq as a specification language to palliate our lack of non-determinism in the Fe-Si language. Moreover, more complex scheduling can be implemented using a few program combinators [12]: we look forward to implementing these in Fe-Si. Finally, note that using Coq as a meta language makes it possible to express (and verify) static elaboration of Fe-Si programs, similarly to what BSV provides for the core of Bluespec. We still have to determine to what extent some parts of BSV module and type systems could be either added in Fe-Si or encoded in Cog-as-a-meta-language. The synchronous programming language Quartz [24] is part of the Averest project that aims at building tools for the development and verification of reactive systems. Quartz is a variant of Esterel that can be compiled to "guarded commands" [23]. The algorithms underlying this compiler have been verified in the HOL theorem prover, which is similar to our approach of verified high-level synthesis. However, in the Averest tool-chain, the verification (automated or interactive using HOL) takes place at the level of the guarded commands; yet, despite the naming similarity, guarded commands are closer to our RTL effects than to our guarded atomic actions. The main difference between our approaches is the place at which verification occurs: in Fe-Si, verification occurs at the level of circuit generators, written in the source language and where high-level constructs are still present; in Averest, it occurs at the intermediate representation level, in which high-level constructs have been transformed. Richards and Lester [22] developed a shallow embedding of a subset of Bluespec in PVS. While they do not address circuit generators or the generation of RTL code, they proved the correctness of a three-input fair arbiter and a two-process implementation of Peterson's algorithm that complements our case studies (we have not attempted to translate these examples into Fe-Si). Slind et al [26] built a compiler that creates correct-by-construction hardware implementations of arithmetic and cryptographic functions that are implemented in a synthesisable subset of HOL. Parametric circuits are not considered. Centaur Technology and the University of Texas have developed a formal verification framework [27] that makes it possible to verify RTL and transistor level code. They implement industrial level tools tied together in the ACL2 theorem prover and focus on hardware validation (starting from existing code). By contrast, we focus on high-level hardware synthesis, with an emphasis on the verification of parametric designs. ### 6 Conclusion Our compiler is available on-line along with our examples as supplementary material [13]. The technical contributions of this paper are: a certified compiler from Fe-Si, a simple hardware description language, to RTL code; and machine checked proofs of correctness for some infinite families of hardware designs expressed in Fe-Si (parameterized by sizes), which are compiled to correct hardware designs using the above compiler. This work is intended to be a proof of concept: much remains to be done to scale our examples to more realistic designs and to make our compiler more powerful (e.g., improving on our current optimizations). Yet, we argue that it provides an economical path to certification of parameterized hardware designs. **Acknowledgements.** We thank MIT's CSG group for invaluable discussions and comments. Part of this research was done while the first author was visiting MIT from University of Grenoble. This material is based on research sponsored by DARPA under agreement number FA8750-12-2-0110. #### References - 1. Augustsson, L., Schwarz, J., Nikhil, R.S.: Bluespec Language definition (2001) - Benton, N., Hur, C.-K., Kennedy, A., McBride, C.: Strongly Typed Term Representations in Coq. J. Autom. Reasoning 49(2), 141–159 (2012) - 3. Berry, G.: The foundations of Esterel. In: Proof, Language, and Interaction, Essays in Honour of Robin Milner. MIT Press (2000) - Beyer, S., Jacobi, C., Kröning, D., Leinenbach, D., Paul, W.J.: Putting it all together Formal verification of the VAMP. STTT 8(4-5), 411–430 (2006) - Bjesse, P., Claessen, K., Sheeran, M., Singh, S.: Lava: Hardware Design in Haskell. In: Proc. ICFP, pp. 174–184. ACM Press (1998) - Bove, A., Coquand, T.: Formalising bitonic sort in type theory. In: Filliâtre, J.-C., Paulin-Mohring, C., Werner, B. (eds.) TYPES 2004. LNCS, vol. 3839, pp. 82–97. Springer, Heidelberg (2006) - Chlipala, A.: Parametric higher-order abstract syntax for mechanized semantics. In: Proc. ICFP, pp. 143–156. ACM (2008) - 8. Chlipala, A.: A verified compiler for an impure functional language. In: Proc. POPL, pp. 93–106. ACM (2010) - Claessen, K., Sheeran, M., Singh, S.: The design and verification of a sorter core. In: Margaria, T., Melham, T.F. (eds.) CHARME 2001. LNCS, vol. 2144, pp. 355–369. Springer, Heidelberg (2001) - Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms, 2nd edn. The MIT Press and McGraw-Hill Book Company (2001) - 11. Coupet-Grimal, S., Jakubiec, L.: Certifying circuits in type theory. Formal Asp. Comput. 16(4), 352–373 (2004) - Dave, N., Arvind, Pellauer, M.: Scheduling as rule composition. In: Proc. MEM-OCODE, pp. 51–60. IEEE (2007) - 13. http://gallium.inria.fr/~braibant/fe-si/ - 14. Gordon, M.: Why Higher-Order Logic is a Good Formalism for Specifying and Verifying Hardware (1985) - 15. Gordon, M.J.C.: Relating Event and Trace Semantics of Hardware Description Languages. Comput. J. 45(1), 27–36 (2002) - Hanna, F.K., Daeche, N., Longley, M.: Veritas<sup>+</sup>: A Specification Language Based on Type Theory. In: Leeser, M., Brown, G. (eds.) Hardware Specification, Verification and Synthesis: Mathematical Aspects. LNCS, vol. 408, pp. 358–379. Springer, Heidelberg (1990) - 17. http://iverilog.icarus.com/ - Hunt Jr., W.A., Brock, B.: The Verification of a Bit-slice ALU. In: Leeser, M., Brown, G. (eds.) Hardware Specification, Verification and Synthesis: Mathematical Aspects. LNCS, vol. 408, pp. 282–306. Springer, Heidelberg (1990) - Leroy, X.: A formally verified compiler back-end. Journal of Automated Reasoning 43(4), 363–446 (2009) - 20. Nikhil, R.S., Czeck, K.R.: BSV by Example (2010) - Pfenning, F., Elliott, C.: Higher-Order Abstract Syntax. In: Proc. PLDI, pp. 199–208. ACM (1988) - Richards, D., Lester, D.R.: A monadic approach to automated reasoning for Bluespec SystemVerilog. ISSE 7(2), 85–95 (2011) - 23. Schneider, K.: Embedding Imperative Synchronous Languages in Interactive Theorem Provers. In: Proc. ACSD, pp. 143–154. IEEE Computer Society (2001) - Schneider, K.: The Synchronous Programming Language Quartz. Technical report, University of Kaiserslautern (2010) - Sheeran, M.: Hardware Design and Functional Programming: a Perfect Match. J. UCS 11(7), 1135–1158 (2005) - 26. Slind, K., Owens, S., Iyoda, J., Gordon, M.: Proof producing synthesis of arithmetic and cryptographic hardware. Formal Asp. Comput. 19(3), 343–362 (2007) - Slobodová, A., Davis, J., Swords, S., Hunt Jr., W.A.: A flexible formal verification framework for industrial scale validation. In: Proc. MEMOCODE, pp. 89–97. IEEE (2011) - 28. IEEE Standard System C Language Reference Manual (2006)