Keywords

1 Introduction

Several forms of deep Recurrent Neural Network (RNN) architectures, such as LSTM [7] and GRU [2], have achieved state-of-the-art results in many sequential classification tasks [3, 5, 6, 14, 15, 17] during the past few years. The number of stacked RNN layers, i.e. the network depth, has key importance in extending the ability of the architecture to express more complex dynamic systems [1, 12]. However, training deeper networks poses problems that are yet to be solved.

In this paper, we suggest an approach that breaks the optimization process into several learning phases. Each learning phase includes training an increasingly deeper architecture than the previous ones. In this way, we gradually train and extend the network depth, reducing the deleterious effects of degradation and backpropagation problems. Additionally, by adjusting the appropriate training scheme (mainly the regularization) at every learning phase, we are able to maximize the network performance even further.

2 Gradual Learning

2.1 Notation

Let us represent a network with l layers as a mapping from an input sequence \(X \in \mathcal {X}\) to an output sequence \(\hat{Y}_l \in \mathcal {Y}\) by \(\hat{Y}_l = S_l \circ f_{l} \circ f_{l-1} \circ \dots \circ f_1 (X ; \varTheta _l) \), where the term \(\varTheta _l = \{\theta _1, \dots , \theta _{l}, \theta _{S_l}\}\) denotes the network parameters, such that \(\theta _k\) are the parameters of the \(\text {k}^\text {th}\) layer. We also define the \(\text {l}^\text {th}\) layer cost function by \(J(\varTheta _l) = \mathrm {cost} (\hat{Y}_l, Y)\), where \(\theta ^l = \{\theta _1, \dots , \theta _{l}\}\). Next, we define the gradient vector with respect to \(J(\varTheta )\) by \(\mathbf {g} = \frac{\partial }{\partial \varTheta } J(\varTheta )\), and the gradient vector of the \(\text {k}^\text {th}\) layer parameters with respect to \(J(\varTheta )\) by \(\mathbf {g}_k = \frac{\partial }{\partial \theta _k} J(\varTheta )\).

2.2 Theoretical Motivation

The structure of a neural network comprises a sequential processing scheme of its input. This structure constitutes the Markov chain \(Y-X-T_1-T_2-\dots -T_L\). The goal is to estimate \(P_{Y|T_L}\left( y|t\right) \) by \(Q_{Y|T_L}^\varTheta \left( y|t\right) \). Driven by the Markov relation we state two theorems (without proofs due to space constraints).

Theorem 1

(Maximum Likelihood Estimator (MLE) and minimal negative log-likelihood). Given a training set of N examples \(S = \left\{ (x_i,y_i)\right\} _{i=1}^N\) drawn i.i.d from an unknown distribution \(P_{X,Y} = P_X P_{Y|X}\), the MLE of \(Q_{Y|T_L}^\varTheta \) is given by \(P_{Y|X}\) and the optimal value of the criteria is H(Y|X).

Theorem 2

If \(Q_{Y|T_L}^\varTheta \) satisfies the optimality conditions of Theorem 1, then \(I(X;Y) = I(T_l;Y)\quad \forall l=1,\dots ,L\).

We show that by satisfying the optimality criteria of Theorem 1 we necessarily did not lose relevant information of Y by processing X to \(T_L\). In particular, we show that a necessary condition to achieve the MLE is that the network states, namely \(\left\{ T_l\right\} _{l=1}^L\), will satisfy \(I(Y;X)=I(Y;T_l)\).

2.3 Implementation

Due to the fact that shallow networks are easier to train, we propose a greedy training scheme, where we break the optimization process into L phases (as the number of layers), optimizing \(J(\varTheta _l)\) sequentially as l increases from 1 to L. The training scheme is depicted in Fig. 1.

Fig. 1.
figure 1

Depiction of our training scheme for a 3 layered network. At phase 1 we optimize the parameters of layer 1 according to cost 1. At phase 2, we add layer 2 to the network, and then we optimize the parameters of layers 1, 2, when layer 1 is copied from phase 1 and layer 2 is initialized randomly. At phase 3, we add layer 3 to the network, and then we optimize all of the network’s parameters, when layers 1, 2 are copied from phase 2 and layer 3 is initialized randomly.

3 Layer-Wise Gradient Clipping (LWGC)

Previous studies [4, 8, 13] have shown that covariate shift has a negative effect on the training process among deep neural architectures. Covariate shift is the change in a layer’s input distribution during training, also manifested as internal covariate shift. We suggest that treating each layer weights’ gradient vector individually and clipping the gradients vector layer-wise can reduce internal covariate shift significantly. LWGC for a network with L different layers is formulated as

$$\begin{aligned} \left[ \hat{\mathbf {g}}_1^T, \dots , \hat{\mathbf {g}}_L^T \right] ^T := \left[ \frac{\mu _1}{\max (\mu _1,\left\| \mathbf {g}_1\right\| )}\mathbf {g}_1^T, \dots , \frac{\mu _N}{\max (\mu _N,\left\| \mathbf {g}_N\right\| )}\mathbf {g}_N^T \right] ^T. \end{aligned}$$
(1)

4 Experiments

We present results on a dataset from the field of natural language processing, the PTB, conducted as a word-level dataset.

We conducted two models in our experiments, a reference model and a GL-LWGC LSTM model that was used to check the performance of our methods. Our GL-LWGC LSTM model compared the state-of-the-art results with only two layers and 19M parameters, and achieved state-of-the-art results with the third layer phase. Results of the reference model and GL-LWGC LSTM model are shown in Table 1.

Table 1. Single model validation and test perplexity of the PTB dataset