Brief Announcement: Gradual Learning of Deep Recurrent Neural Network

Aharoni, Ziv; Rattner, Gal; Permuter, Haim

doi:10.1007/978-3-319-94147-9_21

Ziv Aharoni¹⁶,
Gal Rattner¹⁶ &
Haim Permuter¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 10879))

Included in the following conference series:

International Symposium on Cyber Security Cryptography and Machine Learning

1062 Accesses
2 Citations

Abstract

Deep Recurrent Neural Networks (RNNs) achieve state-of-the-art results in many sequence-to-sequence modeling tasks. However, deep RNNs are difficult to train and tend to suffer from overfitting. Motivated by the Data Processing Inequality (DPI) we formulate the multi-layered network as a Markov chain, introducing a training method that comprises training the network gradually and using layer-wise gradient clipping. In total, we have found that applying our methods combined with previously introduced regularization and optimization methods resulted in improvement to the state-of-the-art architectures operating in language modeling tasks.

Access provided by CONRICYT-eBooks. Download conference paper PDF

The Importance of the Current Input in Sequence Modeling

Neural Machine Translation with Recurrent Highway Networks

Residual Recurrent Highway Networks for Learning Deep Sequence Prediction Models

Article 06 June 2018

Keywords

1 Introduction

Several forms of deep Recurrent Neural Network (RNN) architectures, such as LSTM [7] and GRU [2], have achieved state-of-the-art results in many sequential classification tasks [3, 5, 6, 14, 15, 17] during the past few years. The number of stacked RNN layers, i.e. the network depth, has key importance in extending the ability of the architecture to express more complex dynamic systems [1, 12]. However, training deeper networks poses problems that are yet to be solved.

In this paper, we suggest an approach that breaks the optimization process into several learning phases. Each learning phase includes training an increasingly deeper architecture than the previous ones. In this way, we gradually train and extend the network depth, reducing the deleterious effects of degradation and backpropagation problems. Additionally, by adjusting the appropriate training scheme (mainly the regularization) at every learning phase, we are able to maximize the network performance even further.

2 Gradual Learning

2.1 Notation

Let us represent a network with l layers as a mapping from an input sequence $X \in \mathcal {X}$ to an output sequence $\hat{Y}_l \in \mathcal {Y}$ by $\hat{Y}_l = S_l \circ f_{l} \circ f_{l-1} \circ \dots \circ f_1 (X ; \varTheta _l) $, where the term $\varTheta _l = \{\theta _1, \dots , \theta _{l}, \theta _{S_l}\}$ denotes the network parameters, such that $\theta _k$ are the parameters of the $\text {k}^\text {th}$ layer. We also define the $\text {l}^\text {th}$ layer cost function by $J(\varTheta _l) = \mathrm {cost} (\hat{Y}_l, Y)$, where $\theta ^l = \{\theta _1, \dots , \theta _{l}\}$. Next, we define the gradient vector with respect to $J(\varTheta )$ by $\mathbf {g} = \frac{\partial }{\partial \varTheta } J(\varTheta )$, and the gradient vector of the $\text {k}^\text {th}$ layer parameters with respect to $J(\varTheta )$ by $\mathbf {g}_k = \frac{\partial }{\partial \theta _k} J(\varTheta )$.

2.2 Theoretical Motivation

The structure of a neural network comprises a sequential processing scheme of its input. This structure constitutes the Markov chain $Y-X-T_1-T_2-\dots -T_L$. The goal is to estimate $P_{Y|T_L}\left( y|t\right) $ by $Q_{Y|T_L}^\varTheta \left( y|t\right) $. Driven by the Markov relation we state two theorems (without proofs due to space constraints).

Theorem 1

(Maximum Likelihood Estimator (MLE) and minimal negative log-likelihood). Given a training set of N examples $S = \left\{ (x_i,y_i)\right\} _{i=1}^N$ drawn i.i.d from an unknown distribution $P_{X,Y} = P_X P_{Y|X}$, the MLE of $Q_{Y|T_L}^\varTheta $ is given by $P_{Y|X}$ and the optimal value of the criteria is H(Y|X).

Theorem 2

If $Q_{Y|T_L}^\varTheta $ satisfies the optimality conditions of Theorem 1, then $I(X;Y) = I(T_l;Y)\quad \forall l=1,\dots ,L$.

We show that by satisfying the optimality criteria of Theorem 1 we necessarily did not lose relevant information of Y by processing X to $T_L$. In particular, we show that a necessary condition to achieve the MLE is that the network states, namely $\left\{ T_l\right\} _{l=1}^L$, will satisfy $I(Y;X)=I(Y;T_l)$.

2.3 Implementation

Due to the fact that shallow networks are easier to train, we propose a greedy training scheme, where we break the optimization process into L phases (as the number of layers), optimizing $J(\varTheta _l)$ sequentially as l increases from 1 to L. The training scheme is depicted in Fig. 1.

3 Layer-Wise Gradient Clipping (LWGC)

Previous studies [4, 8, 13] have shown that covariate shift has a negative effect on the training process among deep neural architectures. Covariate shift is the change in a layer’s input distribution during training, also manifested as internal covariate shift. We suggest that treating each layer weights’ gradient vector individually and clipping the gradients vector layer-wise can reduce internal covariate shift significantly. LWGC for a network with L different layers is formulated as

$$\begin{aligned} \left[ \hat{\mathbf {g}}_1^T, \dots , \hat{\mathbf {g}}_L^T \right] ^T := \left[ \frac{\mu _1}{\max (\mu _1,\left\| \mathbf {g}_1\right\| )}\mathbf {g}_1^T, \dots , \frac{\mu _N}{\max (\mu _N,\left\| \mathbf {g}_N\right\| )}\mathbf {g}_N^T \right] ^T. \end{aligned}$$

(1)

4 Experiments

We present results on a dataset from the field of natural language processing, the PTB, conducted as a word-level dataset.

We conducted two models in our experiments, a reference model and a GL-LWGC LSTM model that was used to check the performance of our methods. Our GL-LWGC LSTM model compared the state-of-the-art results with only two layers and 19M parameters, and achieved state-of-the-art results with the third layer phase. Results of the reference model and GL-LWGC LSTM model are shown in Table 1.

Table 1. Single model validation and test perplexity of the PTB dataset

Full size table

References

Bianchini, M., Scarselli, F.: On the complexity of neural network classifiers: a comparison between shallow and deep architectures. IEEE Trans. Neural Netw. Learn. Syst. (2014)
Google Scholar
Cho, K., Van Merriënboer, B., Bahdanau, D., Bengio, Y.: On the properties of neural machine translation: encoder-decoder approaches (2014). arXiv preprint arXiv:1409.1259
Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using RNN Encoder-Decoder for statistical machine translation (2014). arXiv preprint arXiv:1406.107
Cooijmans, T., Ballas, N., Laurent, C., Gülçehre, Ç., Courville, A.: Recurrent batch normalization (2016). arXiv preprint arXiv:1603.09025
Ha, D., Dai, A., Le, Q.V.: Hypernetworks (2016). arXiv preprint arXiv:1609.09106
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition (2015). arXiv preprint arXiv:1512.03385
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift (2015). arXiv preprint arXiv:1502.03167
Krause, B., Kahembwe, E., Murray, I., Renals, S.: Dynamic evaluation of neural sequence models (2017). arXiv preprint arXiv:1709.07432
Melis, G., Dyer, C., Blunsom, P.: On the State of the Art of Evaluation in Neural Language Models. ArXiv e-prints, July 2017
Google Scholar
Merity, S., Shirish Keskar, N., Socher, R.: Regularizing and Optimizing LSTM Language Models. ArXiv e-prints, August 2017
Google Scholar
Montufar, G., Pascanu, R., Cho, K., Bengio, Y.: On the number of linear regions of deep neural networks (2014). arXiv preprint arXiv:1402.1869
Shimodaira, H.: Improving predictive inference under covariate shift by weighting the log-likelihood function. J. Stat. Plann. Infer. 90(2), 227–244 (2000)
Article MathSciNet Google Scholar
Smith, L.N., Hand, E.M., Doster, T.: Gradual dropin of layers to train very deep neural networks (2015). arXiv preprint arXiv:1511.06951
Sutskever, I., Martens, J., Dahl, G., Hinton, G.: On the importance of initialization and momentum in deep learning. In: Proceedings of the 30th International Conference on International Conference on Machine Learning, ICML 2013, vol. 28, pp. III-1139–III-1147 (2013). JMLR.org
Yang, Z., Dai, Z., Salakhutdinov, R., Cohen, W.W.: Breaking the softmax bottleneck: a high-rank RNN language model (2017). arXiv preprint arXiv:1711.03953
Zilly, J.G., Srivastava, R.K., Koutník, J., Schmidhuber, J.: Recurrent highway networks (2016). arXiv preprint arXiv:1607.03474
Zoph, B., Le, Q.V.: Neural architecture search with reinforcement learning (2016). arXiv preprint arXiv:1611.01578

Download references

Author information

Authors and Affiliations

Ben-Gurion University, 8410501, Beer-Sheva, Israel
Ziv Aharoni, Gal Rattner & Haim Permuter

Authors

Ziv Aharoni
View author publications
You can also search for this author in PubMed Google Scholar
Gal Rattner
View author publications
You can also search for this author in PubMed Google Scholar
Haim Permuter
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ziv Aharoni .

Editor information

Editors and Affiliations

Ben-Gurion University of the Negev, Beer Sheva, Israel
Itai Dinur
Ben-Gurion University of the Negev, Beer Sheva, Israel
Shlomi Dolev
Tata Consultancy Services (India), Chennai, Tamil Nadu, India
Sachin Lodha

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Aharoni, Z., Rattner, G., Permuter, H. (2018). Brief Announcement: Gradual Learning of Deep Recurrent Neural Network. In: Dinur, I., Dolev, S., Lodha, S. (eds) Cyber Security Cryptography and Machine Learning. CSCML 2018. Lecture Notes in Computer Science(), vol 10879. Springer, Cham. https://doi.org/10.1007/978-3-319-94147-9_21

Download citation

DOI: https://doi.org/10.1007/978-3-319-94147-9_21
Published: 17 June 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-94146-2
Online ISBN: 978-3-319-94147-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Brief Announcement: Gradual Learning of Deep Recurrent Neural Network

Abstract

Similar content being viewed by others

The Importance of the Current Input in Sequence Modeling

Neural Machine Translation with Recurrent Highway Networks

Residual Recurrent Highway Networks for Learning Deep Sequence Prediction Models

Keywords

1 Introduction

2 Gradual Learning

2.1 Notation

2.2 Theoretical Motivation

Theorem 1

Theorem 2

2.3 Implementation

3 Layer-Wise Gradient Clipping (LWGC)

4 Experiments

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Brief Announcement: Gradual Learning of Deep Recurrent Neural Network

Abstract

Similar content being viewed by others

The Importance of the Current Input in Sequence Modeling

Neural Machine Translation with Recurrent Highway Networks

Residual Recurrent Highway Networks for Learning Deep Sequence Prediction Models

Keywords

1 Introduction

2 Gradual Learning

2.1 Notation

2.2 Theoretical Motivation

Theorem 1

Theorem 2

2.3 Implementation

3 Layer-Wise Gradient Clipping (LWGC)

4 Experiments

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation