1 Introduction

The core problem of computer automatic programming is program synthesis. NPI (neural programmer-interpreter) would not generate code fragments, but it learns the rules of conversion from input and output data, and then a task can be achieved through these transformation rules.

1.1 Program synthesis

The task of program synthesis is to find the required programs to satisfy some form of constraint. Different from traditional compilers through semantic translation, high-level code is converted to low-level code through semantic translation. Program synthesis usually searches for programs to fit constraint in program space; the most common constraints are input and output pairs.

1.2 Neural programmer-interpreters

The main challenge of automatic programming is to let the machine learn the program itself and then quickly find the program to generate new programs to solve various tasks. NPI has a core module based on sequence model of LSTM (long short-term memory). It takes properties such as processing parameters and environment variables as input. The output is a keyword, indicating the procedure to call the next function and showing whether the program should be terminated.

The NPI has three learning components: The first is recurrent kernel, the second is a persistent key pair of the program storage module, and the third is a specific program encoder. NPI can express higher-level programs by learning lower-level programs, while reducing the complexity of samples and having a better generalization ability than the sequence-to-sequence LSTMs. The program storage modules allow for effective learning of additional tasks from existing programs. NPI can also use the environment (such as a panel with a read and write pointer) to cache the intermediate values in calculations, reducing the storage burden of the hidden unit. The NPI trains the model in a fully supervised way, which does not learn through a large number of relatively weak labels, but through a few rich samples.

Currently, the NPI model can learn more than 21 kinds of programs, including adding pixels to images, sorting, subtracting, trajectory planning. Crucially, these can be implemented by using a single NPI model with the same parameters.

By using neural networks to represent subprograms and to learn these subprograms from the data, it can generalize tasks with contain rich sensory input and uncertainty. The monitoring approach adopted in this article is to provide fewer tags, but tags contain more information, allowing the model to learn more complex combinations.

2 Related works

2.1 NPI related works

Rumelhart et al. [1] mentioned the use of dynamic programmable networks and the activation of the first layer network as the weight of the second layer network. Sutskever and Hinton [2] studied the relationship between high-order signs. Donnarumma et al. [3] developed a key component of the cognitive control system. Schmidhuber [4] studied the parameters of a slowly changing network and generated context-sensitive weights for the second rapidly changing network, which can only be demonstrated in very limited environments.

Schneider and Chein [5], Anderson [6], [7] proposed several theories about brain regions controlling other parts of the brain to accomplish multiple tasks. Graves et al. [8] developed a NTM (Neural Turing Machine) capable of learning and executing simple replication, simple prioritization, and associative memory.

Vinyals [9] proposed the pointer network, which summarized the concept of encoder attention, thus providing a variable output space according to the length of input sequence. This work is also closely related to program induction.

Banzhaf et al. [10] found useful programs from candidate programs. Mou et al. [11] used handler symbols to learn the embedding of the maximum margin program with the help of the parse tree. Zaremba and Sutskever [12] trained the LSTM model to read characters in the simple program text and correctly predicted the program output. Joulin and Mikolov [13] developed the push stack by adding a repetitive network, which allows for generalization of longer input sequences, rather than a few algorithm patterns during training.

Several papers also studied the application of the recursive neural network (Zaremba and Sutskever [14]; Zaremba et al. [15]; Kaiser and Sutskever [16]; Kurach et al. [17, 18]). Although we have similar motivation, our approach was different, using the combination structure of the program memory explicitly merged into the network, allowing the model to learn a new program through the composite subroutine.

2.2 The history of LSTM

Original LSTM version includes some cells, input gate, and output gate. However, original LSTM have not forgotten gate and peeking connection, even ignored the output gate in some experiments, the deviation of the unit or enter the activation function, the training process through real-time recursive learning and back propagation training. Therefore, the study did not use precise gradient training. Another feature of the original version is the use of the entire portal recursion, which means that all gates are reentered in the previous time step, except for the output of the block loop input. This feature does not appear in any subsequent release.

The first recommendation to modify the LSTM architecture suggests adding the forgotten gate that allows the LSTM to reset its own state, thus allowing the improved LSTM to learn the continuous tasks.

Gers and Schmidhuber [19] proposed that in order to learn accurate timing, cells need to control the structure of the gate. So far, this can only be done through an open output gate. To make the exact time easier to learn, a peep hole connection from the cell to the gate is added to the architecture. In addition, the output activation function is ignored because there is no evidence that it is critical to solve the current problems with LSTM testing.

The last LSTM version is developed by Graves and Schmidhuber [20], namely the vanilla LSTM. This version trained LSTM through reverse propagation and gives the results of TIMIT experimental. Using complete BPTT has an additional advantage, which can check the LSTM gradient and make the practical implementation of finite difference more reliable.

Vanilla LSTM is the most commonly used structure, but other variations have been proposed by researchers. Before the complete reverse propagation training, Gers et al. [21] proposed a training method based on extended Kalman filter, which made LSTM high computational complexity costly in some cases. Schmidhuber et al. [22] proposed a reverse propagation training method with mixed evolution method, but retained the LSTM structure.

Bayer et al. [23] improved different LSTM block structures and improved the adaptability of context-sensitive grammar to the maximum extent. Sak et al. [24] proposed a linear projection layer, which projected the output of LSTM layer to the connection of circular forwarding to reduce the number of parameters of multiple blocks in the LSTM network. Doetsch et al. [25] improved the performance of LSTM in the offline handwriting recognition data set by introducing the training scale parameters to the slope of the gate activation function. Otte et al. [26] improved the convergence rate of LSTM by adding a circular connection between the gates of individual blocks (rather than between blocks).

Cho et al. [27] proposed a variant structure to simplify the LSTM structure, called GRU. GRU does not use the peer connections, output gates and the forgotten gates are coupled to the update gates, and the final GRU reset gate (that is, the output gate corresponding to the LSTM) only connect the loops to the block input. This paper adopts the LSTM variant structure, namely GRU, to improve the performance of NPI, and the training speed of NPI can be improved significantly.

3 The improvement of NPI

3.1 NPI model

The core of NPI is the long short-term memory network. LSTM was proposed by Hochreiter and Schmidhuber. The LSTM plays a routing role between the current state and the previously hidden unit state.

As shown in Fig. 1, in NPI model, current time node is t, \( e_{t} \) is the status of environment, \( a_{t} \) is function parameters, \( e_{t} \) and \( a_{t} \) are as input to the encoder \( f_{\text{enc}} \), generated state \( s_{t} \), and then \( p_{t} \), \( s_{t} \) are as input to MPL and \( f_{\text{lstm}} \), the output is the output state after the update \( h_{t} \), \( h_{t} \) as input will be as three decoder, respectively, \( f_{\text{prog}} \) decoder will generate embedding function keys; \( f_{\text{end}} \) decoder will generate the probability that the program will be terminated. The threshold value \( r_{t} \) in this article will be set to 0.5; \( f_{\arg } \) decoder will update the function parameters of the next time node and determine the environment state of the output of the next time node through the environment change function, which is the principle of NPI operation.

Fig. 1
figure 1

NPI model. The data flow showed in this graph

3.2 LSTM model

As shown in Fig. 2, the main structure of LSTM consists of three gate structures, namely the input gate, the output gate, and the forgotten gate. First determine what information should be discarded in the cell state by forgotten door. Then the input gate determines what information needs to be stored in the cell state. Finally, the output gate determines what information needs to be exported to the next LSTM.

Fig. 2
figure 2

Structure of LSTM. The main structure of LSTM consists of three gate structures, namely the input gate, the output gate, and the forget gate

Unlike the weighted nonlinear recursive function that simply computes the input signal, the LSTM unit has a memory \( c_{t} \) at any time node t.

Hidden unit \( h_{t} \) at time node t:

$$ h_{t} = o_{t} \tanh (c_{t} ). $$

Output gate \( o_{t} \):

$$ o_{t} = \sigma \left( {W_{o} x_{t} + U_{o} h_{t - 1} + V_{o} c_{t} } \right). $$

Activation function \( \sigma \) is the sigmoid function. Cell memory \( c_{t} \):

$$ c_{t} = f_{t} c_{t - 1} + i_{t} \widetilde{c}_{t}. $$

New memories \( \widetilde{c}_{t} \):

$$ \widetilde{c}_{t} = \tanh \left( {W_{c} x_{t} + U_{c} h_{t - 1} } \right). $$

Forgotten gate \( f_{t} \):

$$ f_{t} = \sigma \left( {W_{f} x_{t} + U_{f} h_{t - 1} + v_{f} c_{t - 1} } \right). $$

Input gate \( i_{t} \):

$$ i_{t} = \sigma (W_{i} x_{t} + U_{i} h_{t - 1} + V_{i} c_{t - 1} ). $$

3.3 GRU model

In this article, we will use the LSTM variant structure GRU to improve the LSTM structure in the NPI. GRU has a control unit that regulates the flow of information within the hidden unit, but no single memory unit (Fig. 3).

Fig. 3
figure 3

Structure of GRU. We will improve the structure of LSTM in NPI, using a variant structure of LSTM, GRU

Hidden state \( h_{t} \):

$$ h_{t} = (1 - z_{t} )h_{t - 1} + z_{t} \widetilde{h}_{t} $$

The update gate determines what information needs to be updated \( z_{t} \):

$$ z_{t} = \sigma \left( {W_{z} \cdot x_{t} + U_{z} h_{t - 1} } \right) $$

The candidate of hidden state \( \widetilde{h}_{t} \):

$$ \widetilde{h}_{t} = \tanh (W \cdot x_{t} + U(r_{t} \otimes h_{t - 1} )) $$

The reset gate that allows GRU to forget the previous calculation \( r_{t} \):

$$ r{}_{t} = \sigma (W_{r} \cdot x_{t} + U_{r} \cdot h_{t - 1} ). $$

3.4 The difference between LSTM and GRU

Whether LSTM unit or GRU unit, the most significant feature compared to traditional RNN is the time t added to t + 1, which is lacking in traditional cycle units. Traditional loop units always use new values of the current input and hidden state of the cell to replace the content of the current cell, and retain the existing content of the LSTM unit, GRU unit, and on the basis of the existing content added content after screening. The advantages of this add-on are twofold. First, each unit is easily to remember a specific in the input stream and keep it. Any important property, whether it is the forgotten gate of the LSTM unit or the update door of the GRU, will not be covered by new data, but will remain as it is. Second, and more importantly, this method effectively creates a shortcut to multiple time nodes. The fast path allow error to be propagated back without quickly disappearing (if the control unit is close to saturation at time 1), as they pass through the constraints of multiple bounded nonlinear functions, thus reducing the difficulty due to gradient.

However, there are differences between LSTM and GRU units. In the GRU, the control of the memory in the LSTM unit is removed from the GRU, which simplifies the calculation of the LSTM to some extent. The GRU unit completely shows its contents and is not controlled by any gate. LSTM is another difference between unit in gate location, or is corresponding to the LSTM unit input GRU reset gate. LSTM unit to calculate the content of the new memory, without need to separate control node flow of information from the previous time, but the LSTM control unit will be independent from forgotten gate is added to the memory unit of the number of new memory unit of the memory contents. In GRU unit, when GRU calculates a new candidate activation, it will control the flow of information from a time node on the activation, but cannot add the number of candidate activation independent control, including control by update gate.

3.5 The training of NPI

This section uses input and output pairs to train the improved NPI. The input is \( \varsigma_{t}^{\text{input}} \): {\( e_{t} ,i_{t} ,a_{t} \)}, the output is \( \varsigma_{t}^{\text{output}} \) {\( i_{t + 1} ,a_{t + 1} ,r_{t} \)}, where t is the length of the sequence, \( i_{t} \) and \( i_{t + 1} \) is the id of the program, corresponding to the key in the program space, according to the key can find the next step in the program space to call the subroutine.

$$ \theta^{*} = \arg \hbox{max} \sum\limits_{{(\varsigma^{\text{input}} ,\varsigma^{\text{output}} )}} {\log P\left( {\varsigma^{\text{output}} \left| {\varsigma^{\text{input}} } \right.;\theta } \right)} $$

\( \theta \) is the parameter of model

$$ \log P\left( {\varsigma^{\text{output}} \left| {\varsigma^{\text{input}} } \right.;\theta } \right) = \sum\limits_{t = 1}^{T} {\log P\left( {\varsigma_{t}^{\text{output}} \left| {\varsigma_{1}^{\text{input}} \ldots \varsigma_{t}^{\text{input}} } \right.;\theta } \right)}. $$

Input each \( h_{t} \) as a parameter into the three decoders, \( f_{\text{prog}} \) decoder will generate embedding function keys \( k_{t} \), according to the corresponding values in the program memory, which is the subroutine to be executed by the next time node, in this case ACT; \( f_{\text{end}} \) decoder will generate the probability whether the program will be terminated. The threshold value \( r_{t} \) in this article will be set to 0.5; \( f_{\arg } \) decoder will update the function parameters of the next time node, and determine the environment state of the output of the next time node through the environment change function.

$$ \log P\left( {\varsigma_{t}^{\text{output}} \left| {\varsigma_{1}^{\text{input}} \ldots \varsigma_{t}^{\text{output}} } \right.} \right) = \log P\left( {i_{t + 1} |h_{t} } \right) + \log P(a_{t + 1} |h_{t} ) + \log P(r_{t} |h_{t} ). $$

This paper adopts adaptive curriculum learning, and the frequency of each batch of training samples is proportional to the current prediction error (Fig. 4).

Fig. 4
figure 4

Addition model based on NPI

4 Experiments

4.1 Addition model

As shown in the figure, add computation to the number 934 and 348. The arrow in the grid represents the pointer, which can be moved LEFT and RIGHT in the same line, namely LEFT and RIGHT, LEFT (LEFT), RIGHT (RIGHT), ADD (ADD), ACT (simplified), CARRY (CARRY), WRITE (WRITE). The left in the picture, for example, in a grid, will perform the first subroutine ADD, after a MPL and GRU in the composition of the core network function state of current time node \( h_{t} \), \( h_{t} \), respectively, as the parameter input into three decoders, \( f_{\text{prog}} \) decoder to generate embedding function keys to find the corresponding value in the space program, namely time nodes need to be performed under the subroutine, here is the ACT; \( f_{\text{end}} \) decoder generates the probability of terminating the program, the probability is less than 0.5. \( f_{\arg } \) decoder will update the function parameters of the next time node \( a_{t + 1} \). The subsequent operations are similar to those in the first grid.

4.2 Results

In the NPI addition model, this paper adopts the experimental environment consistent with Scott Reed and Nando DE Freitas, which uses two layers of GRU; each layer contains 256 hidden units. For NPI training, adaptive moment estimation (Adam) was adopted. In practice, Adam’s method works well. Compared with other adaptive learning rate algorithm, it converges faster and learns more effectively and can correct the problems existing in other optimization techniques, such as the problem that the learning rate disappears, the convergence is too slow, or the parameter update of high variance causes the loss function to fluctuate greatly. The learning rate set for NPI training is 0.0001, and the size of batch processing is 1.

The task in the NPI addition model is to read two numbers within two 10-digit and generate the number of answers. The goal is to learn to apply addition and carry operations from right to left in this algorithm.

In this environment, the network is given a grid to store the intermediate calculations. As shown in the figure, there are four Pointers: two for input numbers, one for carry and one for output. At each step, the pointer can move left or right, or a value can be recorded in the grid.

As shown in Fig. 5, under the premise that the accuracy rate is equal, the training time based on LSTM NPI uses is 105 min, and GRU-based NPI improved the performance of the original LSTM-based NPI by nearly 33% under the premise of ensuring equal accuracy.

Fig. 5
figure 5

Training time: LSTM versus GRU. The training time of GRU is much better than LSTM

5 Conclusion

Compared with traditional RNN, both LSTM and GRU units retain existing content and add filtered content on the basis of existing content, which enables the model to have memory function, and subsequent tasks can be carried out on the original basis. In this work, the control of the memory in the LSTM unit is removed from the GRU, which simplifies the calculation of the LSTM to some extent. According to the experimental results, the performance of LSTM and GRU is roughly equivalent, and the performance of GRU in some areas even exceeds that of LSTM.