1 Introduction

A compiler can be quite an opaque piece of software. It may generate large amounts of intermediate code, structured for other compiler phases—not humans—to work with. Ideally, a developer would want to be able to understand each transformation of the compiler, not only by reading the compiler code but also by observing the intermediate representations and see how they relate to each other. If a debugger is available, the developer might observe the compiler transformations by stepping through them and seeing them happen. Or, they might print intermediate representations and trace the expressions of interest between the representations. Both these processes may be tedious and error-prone. This project suggests a new way of observing a compiler’s behavior and making it more accessible to study. The approach is specifically tailored to the structure of a functional compiler, written as a pure function from source program to machine words. By regarding each compilation phase as performing a reduction of this function, we can relate each expression in each intermediate language to the expressions it gets reduced into, letting a developer see how their program is translated, step by step, or several steps at once, without them needing to trace expressions manually.

For this project, we use the CakeML compiler, which is written as a pure function. CakeML is a verified implementation of a subset of Standard ML [11]. The compiler backend is proven to preserve the semantics of the input program at every step, from source code to machine-code [13]. This is achieved by developing each compiler phase alongside a theorem of its correctness [10]. The CakeML project also stands out as an interesting project in the research field of verified compilation, as it is the first such compiler to be bootstrapped, meaning it can create an executable of itself, complete with a proof of the correctness of the executable [13]. The CakeML compiler is particularly well suited for this project as it uses many small, well-defined compilation steps, making the compilation process relatively easy to follow.

We turn to an example of the mentioned opaqueness of compilers. Consider a simple program in CakeML:

figure a

Even when we take such a simple program and run it through the frontend and first phase of the CakeML compiler, the result is this quite complex looking intermediate representationFootnote 1:

figure b

Four compiler phases later, the compiled program has grown even more complex, and has become hard for humans to read:

figure c

The developers of CakeML have expressed a desire for a tool which could output intermediate representations like the ones shown above, with some extended capabilities. The tool would let the user choose two or more intermediate representations to view and allow clicking any part of the source code or intermediate code, which would lead to the corresponding expressions in the other representations to be highlighted. For example, clicking 5 in the source code above would highlight Lit (IntLit 5) in the intermediate representations. By allowing this sort of click highlighting, it is believed that it will be easier for the developers to find possible optimizations to add to the compiler, and also that it will be easier to teach new developers how the compiler works.

The purpose of this paper is to describe how we have made the necessary extensions to the CakeML compiler to allow tracking expressions as they move through phases and to show a proof-of-concept implementation of an interactive tool of the sort the CakeML developers have asked for. The key contributions we make are the following:

  • We introduce a recursive datatype, described in Sect. 2, called trace, which is used to encode ancestry of expressions as they move through the compiler. We then annotate every expression in each intermediate language of the compiler with a trace. By comparing the traces of two expressions, we can decide whether they are related or not, i.e., if one was created from the other during compilation. The datatype also encodes source position so that, by looking at an expression in an intermediate state, one can determine in which exact position or positions in the source code the expression originated.

  • We introduce three new intermediate languages which are used to compile the program state after each phase into JSON format [6], complete with the encoded ancestry information. In Sect. 3.1 we describe how we turn intermediate compiler representations into a standard format. From there, we can transform the program into a simple structure, described in Sect. 3.2, which can then easily be output in the desired format, in our case JSON.

  • We present a simple web interface in Sect. 3.4 which displays the state of the compiler after different phases and which can highlight expressions in different states which are related.

Our solution is designed to work with a pure functional compiler.Footnote 2 The solution is constrained by the proved correctness of the compiler: since the compiler proof demonstrates that the semantics of the input program are preserved at each compiler phase, any large structural changes to the compilation process may cause significant trouble in updating intermediate proofs. We shall cope with these constraints by introducing only small changes to the intermediate languages that will carry the information we need to relate expressions to each other, and then breaking free of the standard compilation process, introducing an orthogonal translation process, which will make outputting the necessary information a simple matter, and release this process from the burden of correctness proofs. We have not attempted to prove any properties of this process at this stage, such as maintaining the well-formedness of traces, but have instead focused on interfering as little as possible with the existing proofs of compiler correctness.

2 Tracking Expressions

The first problem we have solved is that of encoding which expressions in different compiler states are related. Our solution is a recursive datatype which we use to annotate expressions in the compiler’s intermediate languages. We call this datatype “trace”, or tra in the compiler code. Trace values serve as a fingerprint of origin—all expressions with the same trace are of shared origin, and traces can be built upon to create new, unique fingerprints, while still preserving information of origin. At a later point, we use this information to display ancestry of expressions through a graphical interface, as will be described in Sect. 3.4.

Annotating expressions means simply adding a trace as the first argument to expression constructors.Footnote 3 For example, the expression constructor

becomes

With this approach, each expression carries information of its origin that can be read at any point of compilation, while not requiring any more updates to the correctness proofs than ignoring the tra parameter. The trace will not be carried along to the machine-code output of the compiler and is thus alleviated of the burden of having to be proved to preserve semantics. We will show how the trace is read together with the rest of the program after each compiler phase in Sect. 3, through a separate translation process. This allows us to limit the total changes to the compiler itself to adding the traces and logic for building them.

Figure 1 shows the definition of the \(\texttt {tra}\) datatype. It is a recursive type with two non-recursive constructors, and None. is the constructor used to terminate recursion. We use None only for the particular purpose of turning traces off when they are not needed, e.g., when compiling normally. There are also two recursive constructors, and . Here, is an infix constructor which can also be written as Cons in prefix form.

Fig. 1.
figure 1

\(\texttt {tra}\) datatype. Used for encoding the path any one expression has taken through the compiler, where encodes expressions being split and encodes expressions being merged.

Using this datatype, we start by giving every expression in the source AST a unique trace. We do this by taking the source position and encoding it in a . In the example program val x = 3 + 5; the source position of the partially applied function expression 3 + is from line 1, character 9, up to and including line 1, character 11, and the initial trace for this expression would be:

(1)

For all compiler phases following the initial one, traces are only allowed to grow or remain unchanged, never shrink. Traces grow either by extending a trace with or by joining two traces together with . This gives us three possible ways the trace of an expression may be altered as the expression is transformed during a compiler phase. Each correlates to a way that the expression itself may change, as this is exactly what the trace is supposed to inform us about.

The first case is when a compiler phase transforms an expression into exactly one new expression. In this case, the trace will get passed on as is to the new expression, which informs us no splitting or merging of expressions has taken place.

The second case is when an expression is turned into more than one new expression by a compiler phase. For example, when transforming an expression such as (fn x => foo x (x + 5)) 4 into foo 4 (4 + 5) by inlining, we want to be able to show that both instances of 4 in the result originate from the same 4 before the transformation. When this happens, the constructor is used to decorate the old trace with new numbers, starting from 1, producing as many new unique traces as necessary. For example, if the expression carrying the trace shown in (1) gets split into two expressions at a compiler phase, we would invent two new traces, the only difference between them being the outermost number:

(2)

The third and last case to consider is when expressions get merged. Merging is rarer than splitting but is to be expected when performing certain optimizations, e.g., constant folding. If for example, the expressions 3 + and 5 are folded into one expression, 8, we would need some way of knowing that 8 originates in both 3 + and 5. Using the constructor, we can join the traces of 3 + and 5 together, to obtain the following trace:

(3)

To state the rules of growing expressions formally: whenever a compiler function sees two or more expressions being defined on the right-hand side of the definition, and only one on the left, we use to split the trace into several new traces which we attach to the new expressions. Conversely, whenever two or more expressions appear on the left-hand side of a definition, and only one appears on the right, we apply to merge the traces.

To decode ancestry from traces, i.e., to determine whether one expression is the descendant of another, we look at the traces of both expressions. An expression \(e_1\) with the trace \(t_1\) is the ancestor of another expression \(e_2\) with the trace \(t_2\), if and only if \(t_2\) can be derived from \(t_1\). For a familiar metaphor, we can view the traces as binary trees, where a constructor is a node with a numerical value and a left child, Union is a node with no value and two children, and a leaf node with a four-tuple of numerical values. Then we can say with certainty that \(e_2\) was derived from \(e_1\) if and only if \(t_1\) is a subtree of \(t_2\). Note that when \(t_1\) and \(t_2\) are identical, this approach considers \(e_1\) and \(e_2\) to be ancestors of each other, and we must make use of the order of their respective intermediate languages to determine which is the ancestor, and which is the descendant.

The following pseudocode function determines whether a trace is the exact same trace as another, by checking that the traces have the exact same construction, including the same original source positions. Traces containing None are considered invalid for encoding ancestry, and using such a trace will always give the result False.

figure d

Using the above function, we can construct the following function that determines if one trace is the ancestor of another trace. It does not consider two equal traces to be ancestors of each other.

figure e

This function deconstructs the second trace one step at a time, at each step checking for equality with the first trace. If the second trace is a , then the second trace cannot have been constructed from the first, and the result is False.

3 Presentation

Since the existing compiler implementation is a pure function, in which the intermediate representations have no memory of previous reductions, it is necessary that we add some information to intermediate expressions which maintains this memory. This is achieved in a relatively non-intrusive way through the trace datatype, as explained in the previous section. The remainder of the logic for relating expressions and presenting them then occurs without interfering with the regular compilation. To achieve this, we introduce three intermediate languages, and translations from intermediate compiler states to these languages.

Fig. 2.
figure 2

Flow of a program through the modified compiler. Horizontal arrows indicate compilation with machine-code as the target, and vertical arrows represent translation with JSON as the target. Dashed boxes and arrows are possible but unimplemented alternative output formats.

This section covers our second contribution to the CakeML compiler, namely a sideways translation of sorts, that allows us to output every intermediate compiler state without further interference to the regular compilation. With this facility, we can translate the program into a semantics-free JSON representation. Each step of the translation serves a specific purpose. These are, respectively: to move to a more flexible intermediate language without formal semantics and thereby avoid the burden of correctness-proofs; to structure the representations, which cover all intermediate languages, into a simple yet comprehensive format; and to translate the structured representation into JSON. The final conversion to JSON is done in a thin, outer layer, and it is a simple exercise to implement other output formats with minimal changes. The JSON representation can then be used to present the intermediate states of regular compilation, as well as relating expressions in different intermediate representations to each other in the presentation, as will be explained in Sect. 3.4.

Fig. 3.
figure 3

PresLang exp datatype. In total, there are 37 constructors, and more are added as our explorer adds support for more intermediate languages. As can be seen, declarations, expressions, and patterns are not separated, since semantics and well-typedness are no longer a concern.

Figure 2 shows the flow through intermediate languages in the compiler. Each box contains the name of an intermediate language. Horizontal arrows represent regular compilation with machine-code as the target. Vertical arrows represent translation through our newly introduced intermediate languages, with JSON as the target.

3.1 Standard Presentation Format

The first step towards translating the intermediate representations into JSON is to move to a unified representation that captures the syntax of all existing intermediate languages. This representation has no formal semantics, since it will never be transformed into a runnable program, and therefore does not need correctness proofs. We want to create a single, unified type to represent the intermediate languages, so that we then can write a single function to translate this representation further, without having to take into account which intermediate language is being represented. Keeping with the spirit of the CakeML compiler, we make this first translation step from the existing intermediate languages small and almost trivial to implement, rather than translating directly into the smaller type which we will introduce in Sect. 3.2.

This first step is handled by the intermediate language PresLang. This language gathers every declaration, expression and pattern constructor of the existing intermediate languages into a single type. Figure 3 shows an abbreviation of the PresLang exp datatype. Since this type will only ever be presented, not evaluated, semantics and well-typedness of the program are pointless. Therefore, we do not need to separate declarations, expressions, and patterns, but can treat them as being of the same type, simplifying the language and further translation.

Here we also run into an issue we have not yet mentioned, that will limit our approach: most existing intermediate languages are represented by different types. Thus, a function from one intermediate language will not be able to accept an identical looking value from another intermediate language, since the types are incompatible. As the total number of intermediate languages are small and unlikely to change often, as are the types that represent each intermediate language, we cope with this issue by simply writing one function from each type to PresLang, which is quickly done.

3.2 Structuring Output

The language discussed in the previous section, PresLang, groups all declarations, expressions, and patterns from all intermediate language into a single expression type. This type constitutes a small transformation step from the existing intermediate languages but is unwieldy to work with. Thus, our next order of business is to condense PresLang expressions into a small and handy format to provide simplicity and consistency in the layer right before the final output layer.

We created the intermediate language DisplayLang for this purpose. DisplayLang has a single datatype, sExp, with three constructors:

An \(\texttt {sExp}\) expression may either be an Item with an optional trace, a name, and a list of expressions; a Tuple; or a List. The Tuple and List constructors correspond to the tuples and lists of expressions that intermediate language constructors take, while other source level primitives are expressed by a string representation in an Item constructor. This simple structure can capture the entirety of PresLang, while having an easy to traverse tree structure, with strings and traces as the only type of information in the nodes. The addition of tuples and lists also makes for easy printing, as it is only a matter of printing the string contents of Item, surround lists with square brackets, tuples with parentheses, and proceed recursively. Here is an example of how an expression would be encoded in DisplayLang. Imagine a constructor in . For the final, printed output to show , the expression would be encoded in DisplayLang as

By going through this small language after PresLang, the final conversion to JSON becomes very thin, making it easy to add new output formats as needed. Perhaps even more importantly, this set-up provides consistency. The sExp datatype enforces a neat and predictable structure that can be directly translated into a JSON tree structure of named objects and lists, which can then be printed without knowledge of the underlying complexity in PresLang.

In the end, we translate DisplayLang into JSON via a simple scheme: Item becomes an object with a name field containing the name of the original constructor, e.g., “Prog”, its trace converted into a similar structure, and an array of objects in the field args; a Tuple is an object with a boolean isTuple flag and an array of args; and a List is simply turned into an array. The web interface presents this structure by printing it in flat form, with parentheses, and use the traces to associate parts of the presented strings with parts in other intermediate representations.

3.3 From de Bruijn Indices to Variable Names

Here we pause to note a special situation that occurs in translations into PresLang. In PatLang (the first intermediate language in the compiler chain which does not use pattern matches) and the two following intermediate languages, the compiler uses the notational scheme known as De Bruijn indexing [7]. This notation is convenient for compiler implementation and conducting proofs, as it removes possible naming conflicts in \(\beta \)-reduction. However, it is hard for humans to read, and we therefore replace the De Bruijn indices with standard variable names when presenting the compiler states.

As an example, consider the following expression in CakeML, which declares two mutually recursive local functions:

figure f

Figure 4 shows the result of calling pat_to_pres_exp, which converts a PatLang expression to PresLang, on a simplified version of this expression in PatLang. Look carefully at the occurrences of the numbers 0 and 1 before the conversion. The number 0 refers to either the input variable of the functions or to the first function, depending on context. Also, the first function is referred to as either 1 or 0, depending on context. In the result, each of the function definitions is converted to a tuple with three elements: (1) the function’s name; (2) the name of its input variable; and (3) the function body. The first function is named “b” and the second “a”. In all three occurrences of App, the two functions are referred to using the explicit names “a” and “b”.Footnote 4

Fig. 4.
figure 4

Example of replacing De Bruijn indices with variable names. In the result, names have been assigned to the mutually recursive functions inside Letrec, and the functions are referred to using the explicit names “a” and “b”.

Figure 5 shows two clauses of the function that performs this conversion. The number \( h \) is carried along at each call and counts how many bound variables are currently in scope, which is used to assign a unique variable name using a function, \({\textsf {num\_to\_varn}}\), that turns natural numbers into identifier strings.

Fig. 5.
figure 5

Removing De Bruijn indexes in conversion to PresLang. The function pat_to_pres_exp takes a number, \( h \), in addition to an expression. \( h \) is the number of currently bound variables in the scope. This value is used to compute fixed variable names from De Bruijn indices.

3.4 Web Application

Figure 6 shows our web application, with highlights on expressions corresponding to the literal 5 in our running example, val x = 3 + 5;. We print the languages horizontally and without pretty-printing for this proof-of-concept version of the web application. Around each object with a trace (corresponding to an Item in sExp), a \(\texttt {<\!\!span\!\!>}\) element is placed which we can use to highlight it. For each expression, we maintain information about its trace. When an expression in any language is clicked, its trace is sent via an event that is used to find and highlight all expressions in other intermediate representations it is related to.

Fig. 6.
figure 6

Web application after compilation, and after clicking an expression. Three different intermediate languages are shown in the bottom image, with \({\textsf {IntLit}}\;5\) clicked in the middle one. In reality, these expressions are not lined up, but may appear further apart. This figure shows them cropped together for convenience.

4 Related Work

What we have achieved is a way of presenting a compiler’s inner workings by both showing intermediate representations, and how each expression in each intermediate representation relates to expressions before and after it. To the best of our knowledge, there is no other tool available that combines these features. We will briefly mention a selection of previous compiler explorers and their features. They have in common that they perform one of these feats—displaying intermediate states of compilation or relating sections of source code and target code—but not both.

The HERMIT tool is a plugin to the Glasgow Haskell Compiler, GHC, that lets Haskell programmers interact directly with the compilation of their source program [8]. By pausing compilation and allowing the programmer to view and interactively perform transformations on the GHC Core language AST, the programmer can find possible optimizations to their source program. HERMIT thus has powerful capabilities not present in the Compiler Explorer presented in this paper. HERMIT does not, however, implement something similar to our expression tracking, but relies on the programmer making manual edits and using version control of the edits to observe different outcomes.

There exists at least one tool for relating source code in C++, D, Rust, and Go code with corresponding assembly code, line by line [3]. In it, one can highlight the code in the source or the assembly, much the same way as one can in our interface described in Sect. 3.4. However, the tool does not show any intermediate states, and it is therefore not possible to follow the compiler’s inner workings step by step via the tool.

There also exists several tools for outputting intermediate compiler states and information about them. Notably, there is an old, deprecated Compiler Explorer for CakeML [2], developed by Yong Kiam Tan. There is also an unparser for an educational compiler made by Sarkar et al. [12] which can pretty-print the compiler’s intermediate states, and the LLVM Visualization Tool [4] which can display intermediate compiler states in a number of ways, such as with call hierarchies and control-flow graphs. What all these tools lack, however, is the possibility to relate sections of code in each intermediate language. One is presented only with the complete intermediate state, without any facilities to relate expressions in them. For anything but very simple programs, it becomes very difficult to follow the actual translation of sections of code and consequently what transformations the compiler actually performs at intermediate steps. On this basis, we believe our approach is viable and fills an important gap.

5 Summary

We have shown how to extend the pure functions of a verified compiler to be able to track expressions as they pass through compilation, and how this tracking information can be presented. To make it possible to track expressions, we introduce the tra datatype and show how its values can be built gradually to encode ancestry. To output this information, we introduce three new intermediate languages with the special purpose of translation that does not interfere with regular compilation but can still show the intermediate results of it. We have also developed a basic interactive interface for visualizing the tracking information.

The changes to the compiler are included in the master branch of the official CakeML repository [1] and our web application can be found in a separate repository [5].