A Simple Trick for Estimating the Weight Decay Parameter

Rognvaldsson, Thorsteinn S.

doi:10.1007/3-540-49430-8_4

Thorsteinn S. Rognvaldsson⁶

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1524))

5884 Accesses
6 Citations

Abstract

We present a simple trick to get an approximate estimate of the weight decay parameter λ. The method combines early stopping and weight decay, into the estimate

$$ \hat \lambda = ||\nabla E(W_{es} )||/||2W_{es} ||, $$

where W _es is the set of weights at the early stopping point, and E(W) is the training data fit error. The estimate is demonstrated and compared to the standard cross-validation procedure for λ selection on one synthetic and four real life data sets. The result is that λ is as good an estimator for the optimal weight decay parameter value as the standard search estimate, but orders of magnitude quicker to compute. The results also show that weight decay can produce solutions that are significantly superior to committees of networks trained with early stopping.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 74.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

A Survey of Solution Path Algorithms for Regression and Classification Models

Article 25 March 2022

Optimal Selection of the Regularization Function in a Weighted Total Variation Model. Part II: Algorithm, Its Analysis and Numerical Tests

Article 09 June 2017

Correcting a Class of Complete Selection Bias with External Data Based on Importance Weight Estimation

References

Y. S. Abu-Mustafa. Hints. NeuralComputation 7:639–671, 1995.
Google Scholar
C. M. Bishop. Curvature-driven smoothing:A learning algorithm for feedforward networks IEEE Transactions on Neural Networks, 4(5):882–884, 1993.
Article Google Scholar
M. C. Brace, J. Schmidt, and M. Hadlin. Comparison of the forecast accuracy of neural networks with other established techniques. In Proceedings of the First International Form on Application of Neural Networks to Power System, Seattle WA., pages 31–35, 1991.
Google Scholar
W. L. Buntine and A.S. Weigend. Bayesian back-propagation. Complex Systems 5:603–643, 1991.
MATH Google Scholar
P. Cheeseman. On Bayesian model selection In The Mathematics of Generalization-The Proceedings of the SFI/CNLS Workshop on Formal Approaches to Supervised Learning, pages 315–330 Reading, MA 1995. Addison-Wesley.
Google Scholar
G. Cybenko Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2:304–314, 1989.
Article MathSciNet Google Scholar
R. Engle, F. Clive, W. J. Granger, R. Ramanathan, F. Vahid, and M. Werner. Construction of the puget sound forecasting model EPRI Project # RP2919, Quantitative Economics Research Institute, San Diego, CA, 1991.
Google Scholar
S. Geman, E. Bienenstock, and R. Doursat. Neural networks and the bias/variance dilemma. Neural Computation, 4(1):1–58, 1992.
Article Google Scholar
F. Girosi, M. Jones, and T. Poggio. Regularization theory and neural networks architectures. Neural Computation, 7:219–2691995.
Article Google Scholar
L. K. Hansen, C. E. Rasmussen, C. Svarer, and J. Larsen. Adaptive regularization. In J.Vlontzos, J.-N. Hwang, and E. Wilson, editors, Proceedings of the IEEE Workshop on Neural Networks for Signal Processing IV, pages 78–87, Piscataway, NJ, 1994. IEEE Press.
Google Scholar
A. E. Hoerl and R. W. Kennard. Ridge regression: Biased estimation of nonorthogonal problems. Technometrics, 12:55–67, Feb. 1970.
Google Scholar
M. Ishikawa. A structural learning algorithm with forgetting of link weights. Technical Report TR-90-7 Electrotechnical Laboratory, Information Science Division, 1-1-4 Umezono, Tsukuba, Ibaraki 305, Japan, 1990.
Google Scholar
M. G. Kendall and A. Stuart. The Advanced Theory of Statistics. Hofner Publishing Co, New York, third edition1972.
Google Scholar
J. E. Moody and T. S. Rognvaldsson. Smoothing regularizers for projective basis function networks. In Advances in Neural Information Processing Systems 9, Cambridge, MA, 1997. MIT Press.
Google Scholar
S. Nowlan and G. Hinton. Simplifying neural networks by soft weight-sharing. Neural Computation, 4:473–493, 1992.
Article Google Scholar
M. P. Perrone and L. C. Cooper. When networks disagree: Ensemble methods for hybrid neural networks. In Artificial Neural Networks for Speech and Vision pages 126–142, London, 1993. Chapman & Hall.
Google Scholar
D. Plaut, S. Nowlan, and G. Hinton. Experiments on learning by backpropagation Technical Report CMU-CS-86-126, Carnegie Mellon University, Pittsburg, PA, 1986.
Google Scholar
M. Riedmiller and H. Braun. A direct adaptive method for faster backpropagation learning: The RPROP algorithm. In H. Ruspini, editor, Proc. of the IEEE Intl. Conference on Neural Networks, pages 586–591, San Fransisco, California, 1993.
Chapter Google Scholar
B. D. Ripley. Pattern Recognition and Neural Networks. Cambridge University Press, Cambridge, 1996.
MATH Google Scholar
J. Sjoberg and L. Ljung. Overtraining, regularization, and searching for minimum with application to neural nets. Int. J. Control, 62(6):1391–1407, 1995.
Article MathSciNet Google Scholar
A. N. Tikhonov and V. Y. Arsenin. Solutions of Ill-Posed problems. V. H. Winston & Sons, Washington D.C., 1977.
MATH Google Scholar
H. Tong. Non-linear Time Series: A Dynamical System Approach. Clarendon Press, Oxford, 1990.
MATH Google Scholar
J. Utans and J. E. Moody. Selecting neural network architectures via the prediction risk: Application to corporate bond rating prediction. In Proceedings of the First International Conference on Artificial Intelligence Applications on Wall Street. IEEE Computer Society Press, Los Alamitos, CA, 1991.
Google Scholar
G. Wahba, C. Gu, Y. Wang, and R. Chappell. Soft classification, a.k.a. risk estimation, via penalized log likelihood and smoothing spline analysis of variance. In The Mathematics of Generalization-The Proceedings of the SFI/CNLS Workshop on Formal Approaches to Supervised Learning, pages 331–359, Reading, MA, 1995. Addison-Wesley.
Google Scholar
G. Wahba and S. Wold. A completely automatic french curve. Communications in Statistical Theory & Methods, 4:1–17, 1975.
Article MathSciNet Google Scholar
A. Weigend, D. Rumelhart, and B. Hubermann. Back-propagation, weightelimination and time series prediction. In T. Sejnowski, G. Hinton, and D. Touretzky, editors, Proc. of the Connectionist Models Summer School, San Mateo, California, 1990. Morgan Kaufmann Publishers.
Google Scholar
P. M. Williams. Bayesian regularization and pruning using a Laplace prior. Neural Computation, 7:117–143, 1995.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Centre for Computer Architecture(CCA), Halmstad University, P.O. Box 823, S-301 18 Halmstad, Sweden
Thorsteinn S. Rognvaldsson

Authors

Thorsteinn S. Rognvaldsson
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, Willamette University, Salem, OR, 97301, USA
Genevieve B. Orr
GMD First (Forschungszentrum Informationstechnik), Rudower Chaussee 5, D-12489, Berlin, Germany
Klaus-Robert Müller

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Rognvaldsson, T.S. (1998). A Simple Trick for Estimating the Weight Decay Parameter. In: Orr, G.B., Müller, KR. (eds) Neural Networks: Tricks of the Trade. Lecture Notes in Computer Science, vol 1524. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-49430-8_4

Download citation

DOI: https://doi.org/10.1007/3-540-49430-8_4
Published: 28 March 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-65311-0
Online ISBN: 978-3-540-49430-0
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics

A Simple Trick for Estimating the Weight Decay Parameter

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

A Survey of Solution Path Algorithms for Regression and Classification Models

Optimal Selection of the Regularization Function in a Weighted Total Variation Model. Part II: Algorithm, Its Analysis and Numerical Tests

Correcting a Class of Complete Selection Bias with External Data Based on Importance Weight Estimation

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

A Simple Trick for Estimating the Weight Decay Parameter

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

A Survey of Solution Path Algorithms for Regression and Classification Models

Optimal Selection of the Regularization Function in a Weighted Total Variation Model. Part II: Algorithm, Its Analysis and Numerical Tests

Correcting a Class of Complete Selection Bias with External Data Based on Importance Weight Estimation

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation