Abstract
A scheme for the systematic adaptation of the random-parameter distribution widths is introduced. Weights exiting the same input node are combined into a weight group, and the distribution widths of the weight groups are adjusted during training by a method similar to Manhattan updating. A practical algorithm is derived, and an empirical demonstration shows that irrelevant inputs are detected and effectively switched off. The whole scheme was inspired by and is akin to Neal’s and MacKay’s automatic relevance determination. It will therefore be referred to by the same name.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Reference
The term hyperparameter will be borrowed since the g rand are parameters of a prior probability distribution, and are closely related to the hyperparameters α g in MacKays’s work [39].
By defining σ as the exponential of ρ its positivity is always ensured. The other reason for introducing ρ is that σ is a scale parameter. Since a non-informative prior for a scale parameter is uniform on a logarithmic scale (as discussed in Section 11.2), ρ is the natural parameter for any adaptation scheme.
The nature of the inconsistency of scheme ARD2 (vide infra) becomes clearer when the update rule for the ρ gs is analysed. As will be shown shortly in (15.15), the gradient of E with respect to ρ depends on all the weights exiting the input units, that is both the weights feeding into the S-layer and those feeding into the g-layer. However, as illustrated above and discussed in a more general way in [7], pp.340–342, these weights scale differently when the training data are subjected to a linear transformation. Consequently, the sign of the gradient in (15.15), and hence the network’s ‘assumption’ about the significance of the different inputs, can change as the result of such a linear transformation. This is a striking inconsistency, since linear transformations of the data should lead to equivalent networks which differ only by the linear transformation of the weights.
The method of simple weight decay, with á k = 0.01 for all weight grous, was applied for regularization; see Section 12.1.2 for details.
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 1999 Springer-Verlag London Limited
About this chapter
Cite this chapter
Husmeier, D. (1999). Automatic Relevance Determination (ARD). In: Neural Networks for Conditional Probability Estimation. Perspectives in Neural Computing. Springer, London. https://doi.org/10.1007/978-1-4471-0847-4_15
Download citation
DOI: https://doi.org/10.1007/978-1-4471-0847-4_15
Publisher Name: Springer, London
Print ISBN: 978-1-85233-095-8
Online ISBN: 978-1-4471-0847-4
eBook Packages: Springer Book Archive