Keywords

1 Introduction and Previous Work

It has been pointed out that there are two important things in science: (A) Finding answers to given questions, and (B) Coming up with good questions, e.g., [2, 30, 31, 42, 60, 63, 65, 68]. (A) is arguably just the standard problem of computer science. But how to implement the creative part (B) in artificial systems through reinforcement learning (RL), gradient-based artificial neural networks (NNs), and other machine learning methods?

For at least three decades, work on artificial scientists equipped with artificial curiosity and creativity has been published that addresses this question, e.g., [33, 38, 40, 42, 48, 53, 57, 60, 70, 72, 73]. One early such work is the intrinsic motivation-based adversarial system from 1990 [38, 42]. It is an artificial Q &A system designed to invent and answer questions. For that, it uses two artificial NNs. The first NN is called the controller C. C probabilistically generates outputs that may influence an environment. The second NN is called the world model M. It predicts the environmental reactions to C’s outputs. Using gradient descent, M minimizes its error, thus becoming a better predictor. But in a zero-sum game, the reward-maximizing C tries to find sequences of output actions that maximize the error of M. M’s loss is the gain of C (like in the later application of artificial curiosity called GANs [10, 64], but also for the more general cases of sequential data and RL [20, 74, 80]).

C is asking questions through its action sequences: What happens if I do that? M is learning to answer those questions. C is motivated to come up with questions where M does not yet know the answer and loses interest in questions with known answers.

This type of Q &A system helps to understand the world, which is necessary for planning [38, 39, 42] and may boost external reward [2, 31, 40, 50, 52, 58]. Clearly, the adversarial approach makes for a fine exploration strategy in many deterministic environments. In stochastic environments, however, it might fail. C might learn to focus on those parts of the environment where M can always get high prediction errors due to randomness, or due to computational limitations of M. For example, an agent controlled by C might get stuck in front of a TV screen showing highly unpredictable white noise, e.g., [2, 57]. Therefore, in stochastic environments, C’s reward should not be the errors of M, but (an approximation of) the first derivative of M’s errors across subsequent training iterations, that is, M’s learning progress or improvements [40, 54]. As a consequence, despite M’s high errors in front of a noisy TV screen, C won’t get rewarded for getting stuck there, simply because M’s errors won’t improve. Both the totally predictable and the fundamentally unpredictable will get boring.

This simple insight led to lots of follow-up work [57]. For example, one particular RL approach for artificial curiosity in stochastic environments was published in 1995 [72]. A simple M learned to predict or estimate the probabilities of the environment’s possible responses, given C’s actions. After each interaction with the environment, C’s intrinsic reward was the KL-Divergence [25] between M’s estimated probability distributions before and after the resulting new experience—the information gain [72]. This was later also called Bayesian Surprise [19]. Compare earlier work on information gain [66] and its maximization without RL & NNs [6].

In the general RL setting where the environment is only partially observable [61, Sec. 6], C and M may greatly profit from a memory of previous events [38, 39, 43]. Towards this end, both C and M can be implemented as LSTMs [7, 12, 16, 61] or Transformers [28, 75].

The better the predictions of M, the fewer bits are required to encode the history H of observations because short codes can be used for observations that M considers highly probable [17, 83]. That is, the learning progress of M has a lot to do with the concept of compression progress [53, 55,56,57]. But it’s not quite the same thing. In particular, it does not take into account the bits of information needed to specify M. A more general approach is based on algorithmic information theory, e.g., [22, 26, 51, 69, 78, 79]. Here C’s intrinsic reward is indeed based on algorithmic compression progress [53, 55,56,57] based on some coding scheme for the weights of the model network, e.g., [8, 15, 23, 24, 46, 47, 71], and also a coding scheme for the history of all observations so far, given the model [15, 17, 34, 53, 78, 83]. Note that the history of science is a history of compression progress through incremental discovery of simple laws that govern seemingly complex observation sequences [53, 55,56,57].

In early systems, the questions asked by C were restricted in the sense that they always referred to all the details of future inputs, e.g., pixels [38, 42]. That’s why in 1997, a more general adversarial RL machine was built that could ignore many or all of these details and ask arbitrary abstract questions with computable answers [48,49,50]. Example question: if we run this policy (or program) for a while until it executes a special interrupt action, will the internal storage cell number 15 contain the value 5, or not? Again there are two learning, reward-maximising adversaries playing a zero-sum game, occasionally betting on different yes/no outcomes of such computational experiments. The winner of such a bet gets a reward of 1, the loser –1. So each adversary is motivated to come up with questions whose answers surprise the other. And both are motivated to avoid seemingly trivial questions where both already agree on the outcome, or seemingly hard questions that none of them can reliably answer for now. This is the approach closest to what we will present in the following sections.

All the systems above (now often called CM systems [62]) actually maximize the sum of the standard external rewards (for achieving user-given goals) and the intrinsic rewards. Does this distort the basic RL problem?

It turns out not so much. Unlike the external reward for eating three times a day, the curiosity reward in the systems above is ephemeral, because once something is known, there is no additional intrinsic reward for discovering it again. That is, the external reward tends to dominate the total reward. In totally learnable environments, in the long run, the intrinsic reward even vanishes next to the external reward. Which is nice, because in most RL applications we care only for the external reward.

RL Q &A systems of the 1990s did not explicitly, formally enumerate their questions. But the more recent PowerPlay framework (2011) [60, 70] does. Let us step back for a moment. What is the set of all formalisable questions? How to decide whether a given question has been answered by a learning machine? To define a question, we need a computational procedure that takes a solution candidate (possibly proposed by a policy) and decides whether it is an answer to the question or not. PowerPlay essentially enumerates the set of all such procedures (or some user-defined subset thereof), thus enumerating all possible questions or problems. It searches for the simplest question that the current policy cannot yet answer but can quickly learn to answer without forgetting the answers to previously answered questions. What is the simplest such Q &A to be added to the repertoire? It is the cheapest one—the one that is found first. Then the next trial starts, where new Q &As may build on previous Q &As.

In our empirical investigation of Sect. 3, we will revisit the above-mentioned concepts of complex computational experiments with yes/no outcomes, focusing on two settings: (1) The generation of experiments driven by model prediction error in a deterministic reinforcement-providing environment, and (2) An approach where C (driven by information gain) generates pure thought experiments in form of weight matrices of RNNs.

2 Self-invented Experiments Encoded as Neural Networks

We present a CM system where C can design essentially arbitrary computational experiments (including thought experiments) with binary yes/no outcomes. Experiments may run for several time steps. However, C will prefer simple experiments whose outcomes still surprise M, until they become boring.

In general, both the controller C and the model M can be implemented as (potentially multi-dimensional) LSTMs [11]. At each time step \(t=1,2, \ldots \), C’s input includes the current sensory input vector in(t), the external reward vector \(R_e(t)\), and the intrinsic curiosity reward \(R_i(t)\). C may or may not interact directly with the environment through action outputs. How does C ask questions and propose experiments? C has an output unit called the START unit. Once it becomes active (>0.5), C uses a set of extra output units for producing the weight matrix or program \(\theta \) of a separate RNN or LSTM called E (for Experiment), in fast weight programmer style [4, 9, 18, 21, 36, 37, 41, 44, 45].

E takes sensory inputs from the environment and produces actions as outputs. It also has two additional output units, the HALT unit [59] and the RESULT unit. Once the weights \(\theta \) are generated at time step \(t'\), E is tested in a trial, interacting with some environment. Once E’s HALT unit exceeds 0.5 in a later time step \(t''\), the current experiment ends. That is, the experiment computes its own runtime [59]. The experimental outcome \(r(t'')\) is 1 if the activation result\((t'')\) of E’s RESULT unit exceeds 0.5, and 0 otherwise. At time \(t'\), so before the experiment is being executed, M has to compute its output pr\((t') \in [0,1]\) from \(\theta \) (and the history of C’s inputs and actions up to \(t'\), which includes all previous experiments their outcomes). Here, pr\((t')\) models M’s (un)certainty that the final binary outcome of the experiment will be 1 (YES) or 0 (NO). Then the experiment is run.

In short, C is proposing an experimental question in form of \(\theta \) that will yield a binary answer (unless some time limit is reached). M is trying to predict this answer before the experiment is executed. Since E is an RNN and thus a general computer whose weight matrix can implement any program executable on a traditional computer [67], any computable experiment with a binary outcome can be implemented in its weight matrix (ignoring storage limitations of finite RNNs or other computers). That is, by generating an appropriate weight matrix \(\theta \), C can ask any scientific question with a computable solution. In other words, C can propose any scientific hypothesis that is experimentally verifiable or falsifiable.

At \(t''\), M’s previous prediction pr\((t')\) is compared to the later observed outcome \(r(t'')\) of C’s experiment (which spans \(t''-t'\) time steps), and C’s intrinsic curiosity reward \(R_i(t'')\) is proportional to M’s surprise. To calculate it, we interpret pr\((t')\) as M’s estimated probability of \(r(t'')\), given the history of observations so far. Then we train M by gradient descent (with regularization to avoid overfitting) for a fixed amount of time to improve all of its previous predictions including the most recent one. This yields an updated version of M called \(M^*\).

In general, \(M^*\) will compute a different prediction PR\((t')\) of \(r(t'')\), given the history up to \(t'-1\). At time \(t''\), the contribution \(R_{IG}(t'')\) to C’s curiosity reward is proportional to the apparent resulting information gain, the KL-divergence

$$ R_{IG}(t'') \sim D_{KL} \big (PR(t') || pr(t')\big ). $$

If M had a confident belief in a particular experimental outcome, but this belief gets shattered in the wake of C’s experiment, there will be a major surprise and a big insight for M, as well as lots of intrinsic curiosity reward for C. On the other hand, if M was quite unsure about the experimental outcome, and remains quite unsure afterwards, then C’s experiment can hardly surprise M and C will fail to profit much. C is motivated to propose interesting hypotheses or experiments that violate M’s current deep beliefs and expand its horizon. An alternative intrinsic curiosity reward would be based on compression progress [53, 55,56,57].

Note that the entire experimental protocol is the responsibility of \(\theta \). Through \(\theta \), E must initialize the experiment (e.g., by resetting the environment or moving the agent to some start position if that is important to obtain reliable results), then run the experiment by executing a sequence of computational steps or actions, and translate the incoming data sequence into some final abstract binary outcome YES or NO.

C is motivated to design experimental protocols \(\theta \) that surprise M. C will get bored by experiments whose outcomes are predicted by M with little confidence (recall the noisy TV), as well as by experiments whose outcomes are correctly predicted by M with high confidence. C will get rewarded for surprising experiments whose outcomes are incorrectly predicted by M with high confidence.

A negative reward per time step encourages C to be efficient and lazy and come up with simple and fast still surprising experiments. If physical actions in the environment cost much more energy (resulting in immediate negative reward) than E’s internal computations per time step, C is motivated to propose a \(\theta \) defining a “thought experiment” requiring only internal computations, without executing physical actions in the (typically non-differentiable) environment. In fact, due to C’s bias towards the computationally cheapest and least costly experiments that are still surprising to M, most of C’s initial experiments may be thought experiments. Hence, since C, E and M are differentiable, not only M but also C may be often trainable by backpropagation [4] rather than the generally slower policy gradient methods [1, 29, 77, 81]. Of course, this is only true if the reward function is also differentiable with respect to C’s parameters.

3 Experimental Evaluation

Here we present initial studies of the automatic generation of interesting experiments encoded as NNs. We evaluate these systems empirically and discuss the associated challenges. This includes two setups: (1) Adversarial intrinsic reward encourages experiments executed in a differentiable environment through sequences of continuous control actions. We demonstrate that these experiments aid the discovery of goal states in a sparse reward setting. (2) Pure thought experiments encoded as RNNs (without any environmental interactions) are guided by an information gain reward.

Together, these two setups cover the important aspects discussed in Sect. 2: the use of abstract experiments with binary outcomes as a method for curious exploration, and the creation of interesting pure thought experiments encoded as RNNs. We leave the integration of both setups into a single system (as described in Sect. 2) for future work.

3.1 Generating Experiments in a Differentiable Environment

Reinforcement learning (RL) usually involves exploration in an environment with non-differentiable dynamics. This requires RL methods such as policy gradients [82]. To simplify our investigation and focus solely on the generation of self-invented experiments, we introduce a fully differentiable environment that allows for computing analytical policy gradients via backpropagation. This does not limit the generality of our approach, as standard RL methods can be used instead.

Our continuous force field environment is depicted in Fig. 1. The agent has to navigate through a 2D environment with a fixed external force field. This force field can have different levels of complexity. The states in this environment are the position and velocity of the agent. The agent’s actions are real-valued force vectors applied to itself. To encourage laziness and a bias towards simple experiments, each time step is associated with a small negative reward (\(-0.1\)). A sparse large reward (100) is given whenever the agent gets very close to the goal state. We operate in the single life setting without episodic resets. Additional information about the force field environment can be found in Appendix A. Since the environment is deterministic, it is sufficient for C to generate experiments whose results the current M cannot predict.

Fig. 1.
figure 1

A differentiable force field environment. The agent (red) has to navigate to the goal state (yellow) while the external force field exerts forces on the agent. (Color figure online)

Fig. 2.
figure 2

Generating self-invented experiments in a differentiable environment. A controller \(C_\phi \) is motivated to generate experiments \(E_\theta \) that still surprise the model \(M_\textbf{w}\). After execution in the environment, the experiments and their binary results are stored in memory. The model is trained on the history of previous experiments.

Method. Algorithm 1 and Fig. 2 summarize the process for generating a sequence of interesting abstract experiments with binary outcomes. The goal is to test the following three hypotheses:

  • Generated experiments implement exploratory behavior, facilitating the reaching of goal states.

  • If there are negative rewards in proportion to the runtime of experiments, then the average runtime will increase over time, as the controller will find it harder and harder to come up with new short experiments whose outcomes the model cannot yet predict.

  • As the model learns to predict the yes/no results of more and more experiments, it becomes harder for the controller to create experiments whose outcomes surprise the model.

The generated experiments have the form \(E_\psi (s) = (a, \hat{r})\), where \(E_\psi \) is a linear feedforward network with parameters \(\psi \), s is the environment state, a are the actions and \(\hat{r} \in [0, 1]\) is the experimental result. Both s and a are real-valued vectors.

Instead of a HALT unit, a single scalar \(\tau \in \mathbb {R}^+\) determines the number of steps for which an experiment will run. To further simplify the setup, the experiment network is a feedforward NN without recurrence. To make the experimental result differentiable with respect to the runtime parameter, \(\tau \) predicts the mean of a Gaussian distribution with fixed variance over the number of steps. The actual result \(\tilde{r}\) is the expectation of the result unit \(\hat{r}\) over the distribution defined by \(\tau \) (more details on this can be found in Appendix A.1). The binarized result r has the value 1 if \(\tilde{r} > 0.5\), and 0 otherwise. The parameters \(\theta \) of the experiment are the network parameters \(\psi \) together with the runtime parameter \(\tau \), i.e. \(\theta := (\psi , \tau )\).

For a given starting state s, the controller \(C_\phi \) generates experiments: \(C_\phi (s) = \theta \). \(C_\phi \) is a multi-layer perceptron (MLP) with parameters \(\phi \), and \(\theta \) denotes the parameters of the generated experiment. The model \(M_\textbf{w}\) is an MLP with parameters \(\textbf{w}\). It makes a prediction \(M_\textbf{w}(s, \theta ) = \hat{o}\), with \(\hat{o} \in [0, 1]\), for an experiment defined by the starting state s and the parameters \(\theta \).

During each iteration of the algorithm, \(C_\phi \) generates an experiment based on the current state s of the environment. This experiment is executed until the cumulative halting probability defined by the generated \(\tau \) exceeds a certain threshold (e.g., 99%). The starting state s, experiment parameters \(\theta \) and binary result r are saved in a memory buffer \(\mathcal {D}\) of experiments. Every state encountered during the experiment is saved to the state memory buffer \(\mathcal {B}\).

After the experiment execution, the model \(M_\textbf{w}\) is trained for a fixed number of steps of stochastic gradient descent (SGD) to minimize the loss

$$\begin{aligned} \mathcal {L}_M = \mathbb {E}_{(s, \theta , r) \sim \mathcal {D}}[\text {bce}(M_\textbf{w}(s, \theta ), r)], \end{aligned}$$
(1)

where \(\text {bce}\) is the binary cross-entropy loss function.

The third and last part of each iteration is the training of the controller \(C_\phi \). The loss that is being minimized via SGD is

$$\begin{aligned} \mathcal {L}_C = \mathbb {E}_{s \sim \mathcal {B}} [- \text {bce}\big ( M_\textbf{w}(s, C_\phi (s)), \tilde{r}(C_\phi (s), s) \big ) - R_e(C_\phi (s), s)]. \end{aligned}$$
(2)

The function \(\tilde{r}\) maps the experiment parameters and starting state to the continuous result of the experiment. The function \(R_e\) maps the experiment parameters and starting state to the external reward. Note that gradient information will flow back from \(\tilde{r}\) and R to \(\phi \) through the execution of the experiment in the differentiable environment. The first term corresponds to the intrinsic reward for the controller, which encourages it to generate experiments whose outcomes \(M_{\textbf {w}}\) cannot predict. The second term is the external reward from the environment, which punishes long experiments. Since the reward for reaching the goal is sparse and not differentiable with respect to the experiment’s actions, no information about the goal state reaches \(C_\phi \) through the gradient.

Fig. 3.
figure 3

Experiments in the differentiable force field environment

Results and Discussion. To investigate our first hypothesis, Fig. 3a shows the cumulative number of times a goal state was reached during an experiment, adjusted by the number of environment interactions of each experiment. Specifically, it shows \(h(j) = \sum _{k=1}^j \frac{g_k}{n_k}\), where \(j = 1, 2, \ldots \) is the index of the generated experiment, \(g_k\) is 1 if the goal state was reached during the kth experiment and 0 otherwise, and \(n_k\) is the runtime of the kth experiment. Our method, as described above and in Algorithm 1, reaches the most goal states per environment interaction. Purely random experiments also discover goal states, but less frequently. Note that such random exploration in parameter space has been shown to be a powerful exploration strategy [32, 35, 76]. The average runtime of the random experiments is 50 steps, compared to 22.9 for the experiments generated by \(C_\phi \). To rule out a potential unfair bias due to different runtimes, Fig. 6 in the Appendix shows an additional baseline of random experiments with an average runtime of 20 steps, leading to results very similar to those of longer running random experiments. If we remove the intrinsic adversarial reward, the controller is left only with the external reward. This means that there is no \(\text {bce}\) term in Eq. 2. It is not surprising that in this setting, \(C_\phi \) fails to generate experiments that discover goal states, since the gradient of \(\mathcal {L}_C\) contains no information about the sparse goal reward.

Figure 3b addresses our second and third hypotheses. \(C_\phi \) indeed tends to prolong experiments as \(M_\textbf{w}\) has been trained on more experiments, even though experiments with long runtimes are discouraged through the punitive external reward. Our explanation for this is that it becomes harder with time for \(C_\phi \) to come up with short experiments for which \(M_\textbf{w}\) cannot yet accurately predict the correct results. This is supported by the fact that the prediction accuracy of \(M_\textbf{w}\) for newly generated experiments goes up. Specifically, Fig. 3b shows the difference between prediction accuracy of the current \(M_\textbf{w}\) for the newly generated experiment and the expected prediction accuracy of the current \(M_\textbf{w}\) for experiments randomly sampled from a simple prior. This accounts for the general gain of \(M_\textbf{w}\)’s prediction accuracy over the course of training. It can be seen that in the beginning, \(C_\phi \) is successful at creating adversarial experiments that surprise \(M_\textbf{w}\). With time, however, it fails to continue doing so and is forced to create longer experiments to challenge \(M_\textbf{w}\).

figure a

3.2 Pure RNN Thought Experiments

The previous experimental setup uses feedforward NNs as experiments and an intrinsic reward function that is differentiable with respect to the controller’s weights. This section investigates a complementary setup: interesting pure thought experiments (with no environment interactions) are generated in the form of RNNs without any inputs, driven by an intrinsic curiosity reward based on information gain which we treat as non-differentiable.

Method. In many ways, this new setup (depicted in Fig. 4 and described in Algorithm 2 in the Appendix) is similar to the one presented in Sect. 3.1. In what follows, we highlight the important differences.

An experiment \(E_\theta \) is an RNN of the form \((h_{t+1}, r_{t+1}, \gamma _{t+1}) = E_\theta (h_t) \), where \(h_t\) is the hidden state vector, \(r_t \in \{0, 1\}\) is the binary result at experiment time step t, and \(\gamma _t \in [0, 1]\) is the HALT unit. The result r of \(E_\theta \) is the \(r_t\) for the experiment step t where \(\gamma _t\) first is larger than 0.5. Since there is no external environment and the experiments are independent of each other, the model \(M_\textbf{w}\) is again a simple MLP with parameters \(\textbf{w}\). It takes only the experiment parameters \(\theta \) as input and makes a result prediction \(\hat{o} = M_\textbf{w}(\theta ), \hat{o} \in [0, 1]\).

As mentioned above, here we treat the intrinsic reward signal as non-differentiable. This means that—in contrast to the method presented in Sect. 3.1—the controller cannot receive information about \(M_\textbf{w}\) from gradients that are backpropagated through the model. Instead, it has to infer the learning behavior of \(M_\textbf{w}\) from the history \(\omega \) of previous experiments and intrinsic rewards to come up with new surprising experiments. The controller \(C_\phi \) is now an LSTM that is trained by DDPG [27] and generates new experiments solely based on the history of past experiments: \(C_\phi (\omega ) = \theta \). The history \(\omega \) is a sequence of tuples \((\theta _i, r_i, R_i)\), where \(i = 1, 2, \ldots \) is the index of the experiment. It contains experiments up to the last one that has been executed. More details on the training of \(M_\textbf{w}\) and the algorithm can be found in Appendix B.

For these pure thought experiments, we use a reward based on information gain. Let \(\textbf{w}\) be M’s weights at certain point in time. Then a new experiment with parameters \(\theta \) is generated, executed and saved to the buffer. On this buffer \(\mathcal {D}\), which now includes \(\theta \), M is trained for a fixed number of SGD steps to obtain new weights \(\textbf{w}^*\). Then, the information gain reward associated with experiment \(\theta \) is

$$\begin{aligned} R_{IG}(\theta , \textbf{w}, \textbf{w}^*) = \frac{1}{|\mathcal {D}|} \sum _{\tilde{\theta } \in \mathcal {D}} D_{KL}(M_{\textbf{w}^*}(\tilde{\theta }) || M_\textbf{w}(\tilde{\theta })), \end{aligned}$$
(3)

where we interpret the output of the model as a Bernoulli distribution.

Fig. 4.
figure 4

Generating abstract thought experiments encoded as RNNs. The model is trained to predict the results of previous experiments. The controller generates new interesting thought experiments (without environment interactions) based on the history of previous experiments, their results and rewards.

Fig. 5.
figure 5

Empirical results for pure thought experiments encoded as RNNs. Blue: the average runtime of each experiment generated by \(C_\phi \). Purple: information gain reward (Eq. 3) for \(C_\phi \) associated with each experiment. Mean with bootstrapped 95% confidence intervals across 20 seeds.

Results and Discussion. Figure 5 shows the information gain reward associated with each new experiment that \(C_\phi \) generates. We observe that, after a short initial phase, the intrinsic information gain reward steadily declines. This is similar to what we observe for the prediction accuracy in Sect. 3.1: it becomes harder for the controller to generate experiments that surprise the model. It should be mentioned that this is a natural effect, since—as the model is trained on more and more experiments—every new additional experiment contributes on average less to the model’s change during training, and thus is associated with less information gain reward. An interesting, albeit minor, effect shown in Fig. 5 is that also in this setup, the average runtime of the generated experiments increases slightly over time, even though there is no negative reward for longer thought experiments. For shorter experiments, however, it is apparently easier for the model to learn to predict the results. Hence, at least in the beginning, they yield more learning progress and more information gain. Later, however, longer experiments become more interesting.

In comparison to the experiments generated in Sect. 3.1, the present ones have a much shorter runtime. This is a side-effect of the experiments being RNNs with a HALT unit; for randomly initialized experiments, the average runtime is approximately 1.6 steps.

4 Conclusion and Future Work

We extended the neural Controller-Model (CM) framework through the notion of arbitrary self-invented computational experiments with binary outcomes: experimental protocols are essentially programs interacting with the environment, encoded as the weight matrices of RNNs generated by the controller. The model has to predict the outcome of an experiment based solely on the experiment’s parameters. By creating experiments whose outcomes surprise the model, the controller curiously explores its environment and what can be done in it. Such a system is analogous to a scientist who designs experiments to gain insights about the physical world. However, an experiment does not necessarily involve actions taken in the environment: it may be a pure thought experiment akin to those of mathematicians.

We provide an empirical evaluation of two simple instances of such systems, focusing on different and complementary aspects of the idea. In the first setup, we show that self-invented abstract experiments encoded as feedforward networks interacting with a continuous control environment facilitate the discovery of rewarding goal states. Furthermore, we see that over time the controller is forced to create longer experiments (even though this is associated with a larger negative external reward) as short experiments start failing to surprise the model. In the second setup, the controller generates pure abstract thought experiments in the form of RNNs. We observe that over time, newly generated experiments result in less intrinsic information gain reward. Again, later experiments tend to have slightly longer runtime. We hypothesize that this is because simple experiments initially lead to a lot of information gain per time interval, but later do not provide much insight anymore.

These two empirical setups should be seen as initial steps towards more capable systems such as the one proposed in Sect. 2. Scaling these methods to more complex environments and the generation of more sophisticated experiments, however, is not without challenges. Direct generation and interpretation of NN weights may not be very effective for large and deep networks. Previous work [3] already combined hypernetworks [13] and policy fingerprinting [5, 14] to generate and evaluate policies. Similar innovations will facilitate the generation of abstract self-invented experiments beyond the small scale setups presented in this paper.