Keywords

1 Introduction

Bijective encodings of tree-like structures go back to Ackermann’s bijection between natural numbers and hereditarily finite sets [1]. They are relatively easy to design if one does not care about one side of the bijection exponentially exploding in size, as it is the case, for instance, with Ackermann’s bijection.

With significant effort, such size-proportionate bijections between term algebras and the set of natural numbers represented with the usual binary notation are defined in [21], using ranking of balanced parentheses languages and a generalization of Cantor’s pairing function [5, 16] to tuples. However, the binary search and complex computations involved in the ranking algorithms limit the encoding described in [21] to relatively small terms and numbers not larger than about 2000 bits.

A more revolutionary approach has been sketched out in [23]. Instead of trying to adjust the bijective Gödel numbering scheme to be size-proportionate as a bijection to bitstring-represented numbers, [23] replaces its target: natural numbers are represented as binary trees. This paper generalizes that approach to an arbitrary member of the Catalan family of combinatorial objects [20], on which the usual arithmetic operations are defined. At the same time, it lifts a limitation on the size-proportionate encoding of [23] where that property is lost unless de Bruijn indices fit in a fixed size word. The generalization of binary-tree arithmetic to Catalan objects, on which we rely in this paper, is described extensively in the unpublished arxiv draft [22], a subset of which is covered in [25].

Following [22], a Haskell type class will be used to abstract away the number representation. This has the benefit of having arithmetic operations implemented by any instance of the Catalan family of combinatorial objects. It will also enable trying out our algorithms on the usual “human friendly” natural numbers, which can be seen, via a bijective transformation, as such an instance.

The arithmetic operations performed with the Catalan family based numbering system can work with numbers comparable in size with Knuth’s “arrow-up” notation. These computations have a worst case and average case complexity that is comparable with the traditional binary numbers, while their best case complexity outperforms binary numbers by an arbitrary tower of exponents factor.

More importantly, encodings of lambda terms, that can be seen as a tree-to-tree transformation, are naturally size-proportionate.

Our bijective encoding to tree-based number systems will provide the means to derive an algorithm for the generation of random lambda terms from well-known random generation algorithms for binary trees (Rémy’s algorithm). Random lambda terms (and in particular, the very large ones our encoding enables) can be useful for testing tools where they play the role of an intermediate language, like compilers for functional languages and proof assistants.

By adding a normal order reducer of our lambda terms we provide a uniform representation for computations with lambda terms, and efficient arithmetic operations - two essential building blocks of modern functional languages.

Together, these have applications to implementation of domain specific languages, compiler stages and proof assistants relying on lambda terms as their intermediate language.

The paper is organized as follows. Section 2 introduces the compressed de Bruijn terms and bijective transformations from them to standard lambda terms. Section 3 describes mappings from lambda terms to Catalan families of combinatorial objects. These mappings lead to size-proportionate ranking and unranking algorithms for lambda terms. Section 4 gives an algorithm for normal order reduction of lambda terms that also extends to their Catalan encodings. Section 5 relates combinatorial generation of Catalan objects and that of lambda terms. Section 6 introduces algorithms for generation of random lambda terms. Section 7 discusses related work. Section 8 concludes the paper.

The paper is organized as literate Haskell program. The code in the paper is available at http://www.cse.unt.edu/~tarau/research/2015/XDB.hs, tested with GHC 7.10.2. It defines a module that includes code from [25] (file GCcat.hs, a superset of which is available from the arXiv draft [22]), which defines a type class for arithmetic operations with Catalan objects, generically. It also includes Haskell library packages needed for the generation of random binary trees.

figure a

To achieve a size-proportionate bijective Gödel numbering scheme, all our arithmetic computations will be performed with members of the type class Cat which provides a generic implementation in terms of members of the Catalan family of combinatorial objects [22], in particular binary or multiway trees with empty leaves.

2 A Compressed Representation of de Bruijn Terms

We will summarize here a compressed representation for lambda terms in de Bruijn notation introduced as Prolog program in [24], that will facilitate defining a bijection to tree-represented natural numbers.

2.1 De Bruijn Indices

De Bruijn indices [4, 13] provide a name-free representation of lambda terms. All terms that can be transformed by a renaming of variables (\(\alpha \)-conversion) will share a unique representation. Variables following lambda abstractions are omitted and their occurrences are marked with positive integers counting the number of lambdas until the one binding them is found on the way up to the root of the term.

We represent them using the constructor Ab for application, Lb for lambda abstractions (that we will call shortly binders) and Vb for marking the integers corresponding to the de Bruijn indices. This gives the Haskell data type B a as the definition of the de Bruijn terms parameterized by the type a of the indices used by Vb.

figure b

For instance, when the parameter a is specialized to ordinary integers, the S combinator \( ~\lambda x. \lambda y. \lambda z.(x~z)~(y~z) \) is represented as Lb (Lb (Lb (Ab (Ab (Vb 2) (Vb 0)) (Ab (Vb 1) (Vb 0))))), corresponding to the fact that Vb 2 is bound by the outermost lambda (three steps away, counting from 0) and the occurrences of Vb 0 are bound each by the closest lambda on the way to the root, represented by the third constructor Lb.

2.2 Compressed de Bruijn Terms

Iterated lambdas (represented as a block of constructors Lb in the de Bruijn notation) can be seen as a successor arithmetic representation of a number that counts them. So it makes sense to represent that number in a more efficient numbering system. Note that in de Bruijn notation blocks of lambdas can wrap either applications or variable occurrences represented as indices. This suggests using just two constructors: Vx indicating in a term Vx k n that we have k lambdas wrapped around the de Bruijn index Vb n and Ax, indicating in a term Ax k x y that k lambdas are wrapped around the application Ab x y.

We call the terms built this way with the constructors Vx and Ax compressed de Bruijn terms. They are specified by The Haskell data type X.

figure c

For instance, the S combinator Lb (Lb (Lb (Ab (Ab (Vb 2) (Vb 0)) (Ab (Vb 1) (Vb 0))))) in de Bruijn notation, will be represented as Ax 3 (Ax 0 (Vx 0 2) (Vx 0 0)) (Ax 0 (Vx 0 1) (Vx 0 0)), with the outermost constructor Ax encoding the three Lb binders and k=0 elsewhere indicating the presence of no lambda binder in (front of) applications Ax k or indices Vx k. Note also that lambda binders counted by k in a leaf term Vx k n can bind at most one variable as no application splits the tree below them.

Open and Closed Terms. Lambda terms might contain free variables not associated to any binders. Such terms a called open. Any syntactically well formed term of types B and X is an open term. A closed term is such that each variable occurrence is associated to a binder. Closed terms can be easily identified by ensuring that the lambda binders on a given path from root outnumber the de Bruijn index of a variable occurrence ending the path.

To facilitate size-proportionate encodings of lambda terms, arithmetic operations in this paper will be performed in terms of tree instances of the type class Cat , described in the companion paper [25], a superset of which is available as [22]. The function isClosedX checks that a compressed de Bruijn term is closed by trying to find a lambda binding every index on the way up to the root of the lambda tree. The addition operation add and successor function s, defined for instances of the type class Cat (see [22, 25]), will be used to count binders and the comparison operation cmp (see [22]) will ensure that binders on the way down from the root outnumber index values at the leaves of the lambda tree.

figure d

Example 1

isClosedX on the K combinator \(\lambda x_0 . (\lambda x_1 . ~ x_0)\), written (Vx 2 1) as a compressed de Bruijn term, and a similar small open term. Note the use of both Cat instances N and T parameterizing our (compressed) de Bruijn terms.

figure e

2.3 Converting from de Bruijn to Compressed de Bruijn Terms

The function b2x converts from the usual de Bruijn representation to the compressed one. It proceeds by case analysis on Vb, Ab, Lb and counts the binders Lb as it descends toward the leaves of the tree. Its steps are controlled by the successor function s that increments the counts when crossing a binder.

figure f

2.4 Converting from Compressed de Bruijn to de Bruijn Terms

The function x2b converts from the compressed to the usual de Bruijn representation. It reverses the effect of b2x by expanding the k in V k n and A k x y into k Lb binders (no binders when k=0). The function iterLam performs this operation in both cases, and the predecessor function s’ computes the decrements at each step.

figure g
figure h

Proposition 1

The functions b2x and x2b, having as domains and range open terms, are inverses.

Example 2

The conversion between types B and X of the combinator \( Y = ~\lambda x_0.(\lambda x_1.(x_0~(x_1~x_1)) ~ ~\lambda x_2.(x_0~(x_2~x_2)) ) \).

figure i

This bijection allows borrowing algorithms between the two representations. The function isClosedB tests if a term in de Bruijn notation is closed.

figure j

3 Ranking and Unranking as a Catalan Embedding of Compressed de Bruijn Terms

We will derive an encoding of the compressed de Bruijn terms into objects of type Cat, such that the binary tree instance of type Cat is size-proportionate with the encoded term.

The intuition behind the algorithm is that we want leaf nodes of the lambda term to encode into leaves or small trees close to the leaves and application nodes to encode into internal nodes of the binary tree, as much as possible.

3.1 Ranking Compressed de Bruijn Terms

The function x2t implements such an encoding.

figure k

Note that leaves Vx k x are encoded either as empty leaves of the binary tree or as subtrees with the right branch an empty leaf. To ensure the encoding is bijective, we will need to decrement the result of the constructor c twice in the second rule, with the predecessor function s’ to ensure that this case leaves no gaps in the range of the function x2t. For application nodes Ax k a b we recurse on nodes a and b and then we put the branches together with the constructor c. When c=C or c=M, this results in a tree of a size proportionate to the compressed de Bruijn term.

3.2 Unranking Compressed de Bruijn Terms

The decoding function t2x reverses the steps of the encoder x2t.

Case analysis on the right branch of the binary tree will tell if it is a leaf node or an internal node of the lambda tree, in which case the increment in x2t, needed for bijectivity, is reversed by applying the successor function s twice before applying the deconstructor c’.

figure l

Proposition 2

The functions t2x and x2t, converting between open compressed de Bruijn terms and corresponding instances of Cat, are inverses.

Example 3

The work of t2x and x2t on Cat instance N.

figure m

Note however that when using the instance N of Cat which implies the usual binary number representation, the encoding is, as expected, not size proportionate.

This precludes the use of the usual random number generators returning integers in binary notation to generate very large random lambda terms. We will circumvent this problem by using instead an algorithm that (uniformly) generates random binary trees (see Sect. 6).

We define the unranking function t2b and the ranking function b2t for de Bruijn terms, as follows.

figure n

Proposition 3

The functions t2b and b2t converting between open de Bruijn terms and corresponding instances of Cat, are inverses.

Example 4

The encoding and decoding of the de Bruijn form of the pairing combinator \( ~\lambda x_0. ~\lambda x_1. ~\lambda x_2.((x_2~x_0)~x_1) \) to ordinary binary numbers and binary trees.

figure o

To facilitate comparison, it is useful to define the functions sizeT that returns the number of internal nodes of the binary tree view of a Catalan object and sizeX returning the size of a lambda term in compressed de Bruijn form, in which numeric components k and n are also measured with sizeT.

figure p

Example 5

The sum of the two sizes on an initial segment of \(\mathbb {N}\) illustrates the fact that the bijection t2x is indeed size-proportionate.

figure q

Proposition 4

The average time complexity of t2x and x2t is O(n) for input size n and their worst case time complexity is \(O(n~log^{*}(n))\) when working on instance T (binary trees).

Proof

It follows from the fact that the average complexity of c, c’ s and s’ is constant time, the worst case complexity of s and s’ is \(O(log^{*}(n))\) and that O(n) of these are performed by t2x and x2t.

3.3 Conversion to/from a Canonical Representation of Lambda Terms with Integer Variable Names

We represent standard lambda terms [2] by using the constructors Ls for lambda abstractions, As for applications and Vs for variable occurrences.

figure r

The function b2s converts from the de Bruijn representation to lambda terms whose canonical names are provided by natural numbers. We will call them terms in standard notation.

figure s

Note the use of the helper function at that associates to an index i a variable in position i on the list vs. As we initialize in b2s when calling helper function f the list of index variables to [], we enforce that only closed terms (having no free variables) are accepted.

The inverse transformation is defined by the function s2b.

figure t

Note again the use of at, this time to locate the index i on the list of variables vs. By initializing vs with [] in the call to helper function f, we enforce that only closed terms are accepted.

Proposition 5

The functions s2b and b2s, converting between closed de Bruijn terms and closed standard terms, are inverses.

Example 6

The bijection defined by the functions s2b and b2s on the term \(~\lambda x_0.(\lambda x_1.(x_0~(x_1~x_1)) ~ \lambda x_2.(x_0~(x_2~x_2)) )\).

figure u

4 Normalization with Tree-Based Arithmetic Operations

We will now describe an evaluation mechanism for the (Turing-complete) language of closed lambda terms, called normal order reduction [19]. A mapping between de Bruijn terms and a new data type that mimics standard lambda terms, except for representing binders as functions in the underlying implementation language, will be used both ways to evaluate and then return the result as a de Bruijn term.

4.1 Representing Lambdas as Functions in the Implementation Language

The data type H represents leaves Vh of the lambda tree and applications Ah the same way as the standard lambda terms of type S. However, Lambda binders, meant to be substituted with terms during \(\beta \)-reduction steps are represented as functions from the domain H to itself.

figure v

4.2 A HOAS-style Normal Order Reducer

Normal order evaluation [19] ensures that if a normal form exists, it is found after a finite number of steps. In lambda-calculus based functional languages computing a normal form, normalization can be achieved through a HOAS (Higher-Order Abstract Syntax) mechanism, that borrows the substitution operation from the underlying “meta-language”. To this end, lambdas are implemented as functions which get executed (usually lazily) when substitutions occur. We refer to [18] for the original description of this mechanism, widely used these days for implementing embedded domain specific languages and proof assistants in languages like Haskell or ML.

The function nf implements normalization of a term of type H, derived from a closed de Bruijn term. At each normalization step, when encountering a binder of the form Lh f, the normalizer nf traverses it and it is composed with f. At each application step Ah f a, if the left branch is a lambda, it is applied to the reduced form of the right branch, as implemented by the helper function h. Otherwise, the application node is left unchanged.

figure w

The result of implementing lambdas as functions is that we not only borrow substitutions from the underlying Haskell system but also the underlying (normal) order of evaluation.

4.3 Closed Terms to/from HOAS

To implement conversion from the type H to the type B the function h2b traverses the application nodes. As in the case of our other transformers, the (simple) numerical computations involved in the transformations will be performed using the arithmetic on Catalan objects of type Cat.

figure x
figure y

Example 7

Testing that h2b is a left inverse of h2b.

figure z

While so called “exotic terms” are possible in the data type H to which no terms of type B correspond, the terms brought to the H side by b2h and back by h2b are identical.

4.4 Evaluating Closed Lambda Terms

As our normal order reduction is borrowed via a HOAS mechanism from the underlying Haskell system, evaluation is restricted to closed terms. Instead of getting help form a Maybe type, it is simpler to define its result as the trivial open term Vb e for all open terms.

We obtain a normal order reducer for de Bruijn terms by wrapping up nf with the transformers b2h and h2b.

figure aa

We can then lend the evaluator also to compressed de Bruijn terms.

figure ab

Example 8

Reduction to the identity \(I = ~\lambda x_0.x_0 \) of \(SKK = (( ~\lambda x_0. ~\lambda x_1. ~\lambda x_2. ((x_0~x_2)~ (x_1~x_2)) ~ ~\lambda x_3. ~\lambda x_4.x_3 )~ ~\lambda x_5. ~\lambda x_6.x_5 )\) in compressed de Bruijn notation.

figure ac

4.5 Catalan Objects as Lambda Terms

Given the bijection between instances of the Catalan family, we can go one step further and extend the evaluator to binary trees.

figure ad

As we have also made the usual natural numbers members of the Catalan family, we can define normal order reduction of such “arithmetized” lambda terms as the arithmetic function evalN.

figure ae

Example 9

Evaluation of binary trees and natural numbers seen as lambda terms.

figure af

As evaluation happens in a Turing-complete language, these functions are not total. For instance, evalN 318, corresponding to the lambda term \(\omega = (\lambda x.(x ~x)) (\lambda x.(x ~x))\), is non-terminating.

5 Generation of Catalan Objects and Lambda Terms

Given the size-proportionate bijection between open lambda terms and Catalan objects, we can use generators for the later to generate the former.

5.1 A Generator for Catalan Objects

The function genCat implements a simple generator for Catalan objects with a fixed number of internal nodes. Note that computations are expressed in terms of the arithmetic operations on type Cat. It uses the function nums (see [22]) that generates an initial segment of the set of natural numbers as a Haskell list.

figure ag

Example 10

Generation of Catalan object with 3 internal nodes and their natural number encodings.

figure ah

Given that closed terms have interesting uses in random testing [8], we derive generators for them in compressed de Bruijn and de Bruijn form.

figure ai

Example 11

Generation of closed compressed de Bruijn terms decoded from binary trees with 3 internal nodes.

figure aj

5.2 Generation of Lambda Terms via Unranking

While direct enumeration of terms constrained by number of nodes or depth is straightforward in Haskell an unranking algorithm is also usable for generation of large terms, including generation of very large random terms.

Generating Open Terms in Compressed de Bruijn Form. Open terms are generated simply by iterating over an initial segment of \(\mathbb {N}\) with the function t2x.

figure ak

Reusing unranking-based open term generators for more constrained families of lambda terms works when their asymptotic density is relatively high. Fortunately we know from the extensive quantitative analysis available in the literature [6, 7, 11] when this is the case.

The function genClosedX generates closed terms by filtering the results of genOpenX with the predicate isClosedX.

figure al

Example 12

Generation of closed compressed de Bruijn terms. Note the more than 50 % closed terms among the first 10000 open terms.

figure am

6 Random Generation of Lambda Terms

As the ranking bijection of the compressed de Bruijn lambda terms maps them to Catalan objects, we can use unranking of uniformly generated random binary trees to generate random terms.

6.1 Generating Random Binary Trees

We will rely on the Haskell library Math.Combinat.Trees to generate binary trees uniformly, using a variant of Rémy’s algorithm described in [14], as well as Haskell’s built-in random generator from package System.Random. This will allow generation of random lambda terms corresponding to super-exponentially sized numbers of type N, but size-proportionate when natural numbers are represented by the binary trees of type T.

The function ranCat is parametrized by the function tf that picks a type for a leaf among the instances of Cat, to be propagated as the type of tree, as well as the size of the tree and the random generator g.

figure an

The function ranCat1 allows getting a random tree of a given size and type, by giving a seed that initializes the random generator g.

figure ao

6.2 Generating Random Compressed de Bruijn Terms

We will use the bijection t2x from Catalan objects to open compressed de Bruijn trees, parameterized by the function tf that picks the type of the instance of Cat to be used.

The function ranOpenX generates random terms in a way similar to the function ranCat.

figure ap

The function ranOpen1X generates random terms given a seed for the random generator.

figure aq

The function ranClosedX filters the generated terms until a closed one is found.

figure ar

The function ranClosed1X works in a similar way, except for providing a seed instead of a random generator.

figure as

Example 13

Generation of some random lambda terms (including very large ones) in compressed de Bruijn form.

figure at

7 Related Work

Originally introduced in [4], the de Bruijn notation makes terms equivalent up to \(\alpha \)-conversion and facilitates their normalization [13]. As indices replace variable names by their stack-order relative positioning to their binders, they are already more compact than standard lambda terms. However as iteration of their lambda binders can be seen as a form of unary Peano arithmetic, it made sense to further compress them by counting the binders more efficiently. This mechanism, is first described in [24] where in combination with the generalized Cantor bijection between \(\mathbb {N}^{k} \rightarrow \mathbb {N}\) it is used to provide a bijective Gödel numbering scheme. However, as the \(\mathbb {N}\rightarrow \mathbb {N}^{{k}}\) side of this bijection is only computable using a binary search algorithm, it is limited to relatively small terms, by contrast to the one described in this paper that works in time proportional to the size of both the terms and their tree-based number encodings.

In [23] a binary tree-arithmetic encoding is introduced, that can be seen as an instance of the generic Catalan-object arithmetic used in this paper. However, as it describes computations in terms of ordinary arithmetic, it is size-proportionate only under the assumption that variables fit in word-represented integers.

Combinatorics of lambda terms, including enumeration, random generation and asymptotic behavior has seen an increased interest recently (see for instance [3, 6, 7, 10, 11]), partly motivated by applications to software testing [8, 17] given the widespread use of lambda terms as an intermediate language in compilers for functional languages and proof assistants.

Ranking and unranking of lambda terms can be seen as a building block for bijective serialization of practical data types [15, 26] as well as for Gödel numbering schemes of theoretical relevance. In fact, ranking functions for sequences can be traced back to Gödel numberings [9, 12] associated to formulas.

8 Conclusions and Future Work

We have provided a fresh look at several aspects of the representation and encoding of lambda terms with focus on their de Bruijn form and a compressed variant of it. Our computations have used a type class defining generic arithmetic operations on members of the Catalan family of combinatorial objects, described in detail in [25]. They have served to implement bijections between representations of terms and conversion to/from HOAS-like form used for normalization of lambda terms. Some interesting synergies have been triggered by this combination of apparently heterogeneous techniques:

  • we have provided a simple size-proportionate bijective encoding of compressed De Bruijn terms to our tree-based “natural numbers”

  • the same “natural numbers” (actually operated on through a binary tree perspective on Catalan objects), have served to do routine arithmetic operations, with average complexity comparable to the usual binary numbers

  • the use of tree-based numbers, a target for ranking/unranking of lambda terms and a uniform random generation algorithm for binary trees, have enabled generation of (possibly very large) random open lambda terms

Note also that our size-proportional encodings as arithmetic-endowed Catalan objects can be easily adapted to other tree representations widely used in computer science and computational sciences, like expression trees, recursive data types, directory structures, parse trees for programming and natural languages, phylogenetic trees, etc.

We have not approached yet some of the remaining hard problems related to uniform random generation of more “realistic” lambda terms appearing in compilers and proof assistants e.g. well-typed and closed terms, for which no linear-time algorithms are known. The techniques, involving binary search, based on ordinary binary numbers for either term algebras in [21] or closed lambda terms in [11] will require more work to adapt to possibly linear algorithms based on our size-proportionate encodings in a tree-based numbering system. Also, an empirical study of the shapes and distribution of frequent lambda term patterns appearing in written and generated code is likely to be useful to fine-tune ranking/unranking algorithms better suited for random generation of such terms.