Keywords

1 Introduction

The growth and tremendous success of deep learning has catapulted us past many benchmarks of artificial intelligence. Reaching human and superhuman performance in object recognition, language generation and translation, and complex games such as Go and Starcraft has pushed the boundaries of what humans can do and machines cannot [7, 12, 13, 16, 19, 22]. To continue to make progress, we must identify and work towards reducing the gaps between human and machine intelligence.

The Abstraction and Reasoning Corpus (ARC), introduced by François Chollet in 2019, captures an important aspect of human intelligence that our current systems are unable to do: the ability to systematically and flexibly generalize to new domains [6]. Chollet argues that intelligence must be measured not as skill in a particular task, but as skill-acquisition efficiency. General intelligent systems must also have developer-aware generalization, i.e. be able to solve problems the developer of the system has not encountered before or anticipated.

ARC consists of training, evaluation, and private test sets of 400, 400, and 200 tasks. Each task consists of 2–4 training examples and one or more test examples. Each training example is an input/output pair of grids. To solve a task, an agent must determine the relationship between input and output grids in the training examples, and use this to produce the correct output grid for each of the test examples, for which the agent is only given the input grid. Each task is thus a few-shot learning problem, for which the solution is symbolic and rule-based (Fig. 1).

Fig. 1.
figure 1

An example ARC task with three training examples and one test example. The solution might be described as “find the most common object in the input grid”.

The tasks are unique and constructed by hand so as to prevent the reverse engineering of any synthetic generation process. They are designed to depend on a set of human Core Knowledge inbuilt priors such as objectness, simple arithmetic abilities, symmetry, and goal-directedness. The evaluation and private test sets are designed such that a solution tailored to the training set is unlikely to transfer to the evaluation or test sets. Chollet hosted a Kaggle-competition for ARC and the winning solution, a hard-coded brute force approach, achieved only \(\sim \)20% performance on the private test set [2].

In this paper we report incremental progress on ARC and lay the foundation for several approaches to abstraction and reasoning not based in brute-force search. We approach ARC as a program synthesis benchmark, solving tasks by writing programs that convert input grids to output grids. In Sect. 2 we outline an approach to abstraction by applying DreamCoder [11]. We show this approach enables learning new concepts that aid in generalization as well as the solving of progressively more challenging tasks. In Sect. 3 we describe a novel program synthesis approach motivated by the way humans approach ARC that captures the reasoning required to search for ARC task solutions. Our algorithm constructs a search graph and reasons over this graph structure to discover task solutions. More specifically, we extend existing execution-guided program synthesis approaches [10, 25] with deductive reasoning based on function inverse semantics [18] to enable a neural-guided bidirectional search algorithm. We evaluate our approach on three domains: ARC tasks, ‘24 Game’ problems, and a simple ‘double-and-add’ challenge. These experiments show the benefits of bidirectional search over baselines and the potential for further progress on ARC. In Sect. 4 we discuss related work, progress on ARC, and future directions.

2 Abstraction Using DreamCoder

We frame the problem as a search problem over the space of programs expressible in some domain specific language (DSL). One way a learning agent can achieve developer aware generalization (in the sense of [6]) is to identify frequently occurring patterns of computation and form abstractions from them. These abstractions enable searching for more complex programs more quickly.

In this section we use DreamCoder [11], a recent tool for program synthesis, to form abstractions. We first show how DreamCoder’s compression algorithm enables learning generalizations of concepts seen in training. Second, we run DreamCoder on ARC to show how forming new abstractions enables the agent to solve progressively more challenging tasks.

2.1 Warmup: Forming Abstractions

To show how DreamCoder can form more abstract concepts from existing ones, we supply our agent with six synthetic tasks (meant to be similar to ARC tasks): drawing a line in three different directions, and moving an object in three different directions. See Fig. 2 for a visualization of these tasks.

We solve these tasks with four primitives: rotate clockwise and counterclockwise, draw a line down, and move an object down. The programs synthesized are the following:

figure a

After running the compression algorithm, the agent creates the following new abstractions:

figure b

Importantly, the abstractions formed are more general than the original primitives given. This can help enable systematic generalization on further tasks.

Fig. 2.
figure 2

Sample tasks involving applying an action left.

2.2 Enabling Generalization on ARC Symmetry Tasks

In a second experiment, we demonstrate how compression-based learning enables developer-aware generalization on ARC. We provide DreamCoder with a set of five grid-manipulation operations: flipping vertically with vertical_flip, rotating clockwise with rotate_cw, overlaying two grids with overlay, stacking two grids vertically with vertical_stack, and getting the left half of a grid with left_half. We then train our agent on a subset of 36 ARC tasks involving symmetry over five iterations of enumeration and compression. During each iteration, our agent attempts to solve all 36 tasks by enumerating possible programs for each task. It then runs compression to create new abstractions. During the next iteration, the agent repeats its search equipped with the new abstractions. In this experiment, our agent initially solves 16 tasks. After one iteration, it solves 17 in the same amount of time. After another, it solves 19 tasks, and after the final iteration, it solves 22 tasks. Table 1 shows some of the new abstractions learned by DreamCoder’s compression algorithm such as flipping horizontally, and stacking grids horizontally. The program solutions for the final tasks solved, shown in Fig. 3, could not be feasibly discovered without the use of abstractions to reduce the search time.

Table 1. Useful actions learned in the process of solving symmetry tasks. Pound signs represent abstractions. Abstractions may rely on others for construction; e.g. to stack grids horizontally, we reflect each input diagonally, stack vertically, and reflect the vertical stack diagonally.

2.3 Discussion

It is useful to compare the learning done in our approach to that done by neural networks. Neural networks can also learn new concepts from training examples, but their internal representation lacks structure which allows them to apply learned concepts compositionally to other tasks. In contrast, functions learned via compression, represented as programs, can naturally be composed and extended to solve harder tasks, while reusing concepts between tasks. This constitutes a learning paradigm which we view as essential to human-like reasoning.

There is a caveat of the approach shown here. Abstraction as shown uses a simple enumerative search. DreamCoder uses a form of neural-guided program synthesis, predicting a distribution over functions to search over, but this guidance is too weak to scale to the complexity of ARC tasks. In the next section, we show the type of reasoning required for ARC and design an approach to exhibit this reasoning.

Fig. 3.
figure 3

One of the four-way mirroring tasks and the program discovered that solves it written in terms of the original primitives. The program was discovered only after four iterations of enumeration and compression.

3 Bidirectional, Neural-Guided Program Search

In Subsect. 3.1 we first motivate and describe our bidirectional, neural-guided search algorithm. Then in Subsect. 3.2 we present experiments and results using this approach.

3.1 Algorithm Description

In this section we describe our reasoning approach for ARC. We first give a motivating example of human reasoning on ARC, explain how to approximate it with execution-guided synthesis, then incorporate inverse semantics to create a bidirectional, neural-guided search algorithm.

Motivating Example. Solving ARC tasks fundamentally consists of a search for valid solutions. To make this search tractable, our agent needs the ability to reason towards solutions. ARC tasks feature rich visual queues that guide us towards solutions. Without enabling our agent to take full advantage of these queues, the search over possible programs becomes impossibly large. The process of discovering the solution to an ARC task often consists of several discrete steps of reasoning before discovering the solution. How can we design an approach to search that searches for ARC solutions in the same manner as humans?

Fig. 4.
figure 4

Task 303.

As a motivating example, let us consider solving task 303 in Fig. 4. The reasoning steps to come to a solution might look something like this:

  1. 1.

    Notice that the output grid consists of copies of the \(3\times 3\) input grid, arranged in a certain arrangement among a \(9\times 9\) grid.

  2. 2.

    New question: Where should we place the input grid copies?

  3. 3.

    Notice that the placements match the arrangement of a different color’s pixels for each grid. For example, in the first example, the diagonal of grids in the output matches the green pixels in the input.

  4. 4.

    New question: What color should we arrange our grid copies along?

  5. 5.

    Solution: The color matched is the most common color in the grid.

Notice the way discovering a solution involves combining sequential insights and problem reductions. Systematizing a form of reasoning for ARC that emulates this reasoning will be based on a combination of execution-guided program synthesis and inverse semantics.

Extending Execution-Guided Synthesis. Execution-guided program synthesis [5, 10] is a form of program synthesis where one executes partial programs to produce intermediate outputs, which are used to guide the construction of the full program. Intermediate evaluations provide the opportunity for step-by-step reasoning: instead of coming to the answer at once, one can construct it piece by piece. Humans could be said to make use of the same thing: for instance, it much easier to write out the result of a multiplication digit by digit, instead of conducting the full calculation in one’s head. The form of execution-guided synthesis we apply to ARC is most similar to the ‘REPL’ approach of [10]. An example applying the technique to ARC is shown in Fig. 5.

Fig. 5.
figure 5

Solving ARC task 138 from the evaluation set with execution-guided synthesis. Conditioned on the input and output grids, the agent chooses to flip the input horizontally in step one. This action is executed to produce intermediate value i1. Next, the agent chooses to horizontally stack the intermediate value with the input grid, producing another value i2. Last, the agent horizontally stacks this value i2 with itself, correctly producing the output grid for each example and solving the task.

Existing execution-based synthesis approaches are limited to bottom-up enumeration: the leaves of the program are constructed (and evaluated) first. In contrast, the steps for solving task 303 involve proposing a function that is used to produce the output grid, and deducing the inputs required to correctly produce the output as new intermediate targets before discovering the complete program.

This form of deductive reasoning involves evaluating function in reverse. It is best exemplified in the FlashMeta system [18], which leverages the inverse semantics of operators to deduce one or more inputs of a function given the output target and one or more inputs. We incorporate this type of reasoning into an extension of execution-guided program synthesis.

Deductive Reasoning via Inverse Semantics. For our purposes, we can consider two cases. The simplest case is when the function is invertible. In this case, we can evaluate the inverse to produce two new targets for the search, as shown in Fig. 6. In the second case, the function is conditionally invertible: given the output and one or more inputs to a function, one can deduce the remaining inputs needed to produce the output via this function. Many functions are conditionally invertible; perhaps the most familiar family is arithmetic operators: if we know \(1 + x = 5\), we can deduce that \(x = 4\). An example relevant to ARC is shown in Fig. 6. Using conditional inverses, it is possible to formalize the reasoning described for task 303.

Fig. 6.
figure 6

Left: the function block is directly invertible: given the output, we can deduce the inputs. Right: the function horizontal_stack (horizontal stack) is conditionally invertible: given the output and one input, we can deduce the other input.

Bidirectional, Neural-Guided Program Search. To extend execution-guided synthesis to a bidirectional algorithm with inverse and conditional inverse functions, we extend the environment of [10], approaching the synthesis task via reinforcement learning.

The setup takes place in a Markov Decision Process. The current state is a graph of nodes. Each node represents either an input value, the output value, or an intermediate value resulting from the execution of an operation in the forwards or backwards direction. A node is grounded if there is a valid program to create that node from the operations applied so far. In general, grounded nodes correspond to those from the forwards, i.e. bottom-up program enumeration, direction of search, while ungrounded nodes correspond to those from the backwards direction, i.e. top-down program enumeration.

An operation is a function from the grammar along with a designation of being applied in forwards, inverse, or as a conditional inverse (and if as a conditional inverse, conditioned on which input arguments). There are three types of operations: forward operations, inverse operations, and conditional inverse operations. A forward operation applies the function to a set of grounded inputs to produce a new grounded node. An invertible operation takes an ungrounded output and produces a new ungrounded target node such that grounding the target node will cause the output node to be grounded as well. A conditionally invertible operation takes an ungrounded output and one or more grounded input nodes, and produces a new ungrounded target node such that grounding the target node will cause the output node to be grounded as well. All invertible and conditionally invertible operations have a corresponding forward operation.

Solving a given task thus consists of an episode in the MDP. Actions in the MDP correspond to a choice of operation and the choice of arguments for that operation. Each action applies a function in either the forward or backward direction. Intuitively, this executes a bidirectional search to try to connect the grounded nodes on one side with the ungrounded output node on the other. We give reward R for solving the task and a penalty of \(-1\) for choosing an action corresponding to an invalid operation.

Like [5, 10, 25], we train with a combination of supervised training on randomly generated programs fine-tuning with reinforcement learning algorithm Reinforce. To generate random bidirectional programs for supervised training, we first create a random program, and construct an execution trace for it by probabilistically converting inverting function applications from the root. Network architecture is held the same from [10], with task-dependent embedding network nodes of the bidirectional graph, a DeepSet network [24] to encode the graph into a single embedding and choose a function to apply, and a pointer network [23] for choosing function arguments.

3.2 Experiments

We evaluate our bidirectional algorithm in three settings: solving ARC symmetry tasks, solving arithmetic puzzles from the ‘24 Game’ family, and solving ‘double-and-add’ puzzles. As a baseline, we compare bidirectional synthesis with a forward-only baseline which only allows application of operations in the forwards direction like existing approaches.

ARC Symmetry Tasks. As a proof of concept, we evaluate the bidirectional algorithm on a set of 18 ARC symmetry tasks—a subset of those used in Sect. 2. We use a DSL of six operations: stacking two grids horizontally or vertically, rotating clockwise or counterclockwise, and flipping a grid horizontally or vertically. The rotation and flip functions are directly invertible, while the stacking operations are conditionally invertible. We use a convolutional neural network to embed grid example sets. We train on a set of randomly generated programs evaluated on random input grids from the ARC training set, and fine-tune with Reinforce before sampling rollouts for thirty minutes on all tasks at once. The agent is able to solve 14 of 18 tasks, including one of the “four-way mirror” tasks. In this experiment, bidirectional performed equally to the forward-only baseline.

24 Game. Next, we compare the performance of bidirectional search with the forward-only baseline by tasking our agent with solving “24 Game” problems. A 24 Game consists of four input numbers, one through nine. To solve the task, one must use each number once in an expression that creates twenty four using \(+, -, \times , \div \). For example, given 8, 1, 3, and 2, a solution is \(24 = (2 - 1) \times 3 \times 8\). To solve these tasks bidirectionally, we can use the conditional inverse of each arithmetic operator in addition to forward arithmetic operations.Footnote 1

First we conduct supervised pretraining on all depths at once. These programs may create any number as a target, not just 24, with the maximum allowed integer 100, and no negative or nonintegral numbers. We then fine-tune on different depths with Reinforce for 10,000 epochs of batch size 1000. We measure performance by percent of episodes solved in the last 1,000 epochs of training.

Results are shown in Table 2. Bidirectional synthesis outperforms the forward-only baseline across all depths. This supports our thesis, but is suspicious: as we should expect to see identical accuracy for depth one tasks, when only a single action is needed. Accuracy remains fairly high as depth increases, because depth does not necessarily imply program length: as many as 40% of depth four tasks remain solvable in fewer than four actions.

Double-and-add. Last, we include results on a ‘double-and-add’ task to better show the advantage of bidirectional search. Given a target number, one must reach it starting from the number two by repeatedly adding one or doubling the number. For example, \(7 = 1 + 2 * (1 + 2)\). This task, akin to the method for exponentiation by repeated squaring, is much easier solved in a top-down fashion: the choice of adding one or doubling boils down to whether the target is even or odd. Here we have two forward operations, each of which are directly invertible. On a training set of five thousand numbers sampled between one and five million, and a held out set of five hundred numbers, our bidirectional model achieves 100% evaluation accuracy after a single epoch of supervised training. In contrast, the forward-only model fails to solve the held-out tasks, due to the difficulty “seeing” the solution from the source, see Fig. 7.

Table 2. Percent of tasks solved for 24 Game, measured by percent of episodes solved in the last 1000 epochs of RL fine-tuning. Forward-only denotes only using forward operations. Bidirectional includes conditional-inverse operations. Average over three runs with stdev shown.
Fig. 7.
figure 7

Percent of tasks solved for bidirectional and forward-only agents trained on the double-and-add task. The bidirectional agent achieves 100% accuracy after a single epoch of training. After fifty epochs of training, forward-only converges without solving the held-out tasks.

4 Discussion

Related Work. Our work builds off and is inspired by a long line of progress in neural program synthesis [3, 4, 8, 17, 21], execution-guided synthesis [5, 10, 26], and deep reinforcement learning for search [15, 19].

Bidirectional, neural-guided program search is made possible primarily due to the inverse semantics of FlashMeta [18]. The concept of bidirectional programming, inverse semantics, and program inversion has been present throughout the history of program synthesis [9, 14, 20], but the way in which inverse evaluation is used here is most similar to FlashMeta.

ARC. To date, there are no prominent learning-based approaches to ARC that have proven more successful than the Kaggle-winning brute-force approach [2]. Other Kaggle approaches include genetic programming and cellular automata, but all essentially rely on brute force search over a DSL of operations combined with ARC-specific tricks, without any substantial learning [1]. The few-shot nature and large search space for ARC make it a very challenging benchmark, and progress scaling program synthesis algorithms is likely needed to enable further progress. We hope our progress reported here inspires and enables further progress on ARC.

Future Work. The next step of our work is to combine the two approaches to create a unified approach. This can be done by using the bidirectional search algorithm to solve tasks, then create new operations out of abstractions base on tasks solved. To fill out the learning approach, we can consider including the ability to synthesize inverse and conditional inverse operations for newly learned abstractions, perhaps as its own synthesis problem. Our approach remains to be scaled up to a full DSL capable of solving ARC. Incorporating more sophisticated inverse semantics and type-directed search are important components of the full bidirectional approach.