An Empirically-Sourced Heuristic for Predetermining the Size of the Hidden Layer of a Multi-layer Perceptron for Large Datasets

Lunt, Amanda; Xu, Shuxiang

doi:10.1007/978-3-319-50127-7_47

Amanda Lunt²¹ &
Shuxiang Xu²¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9992))

Included in the following conference series:

Australasian Joint Conference on Artificial Intelligence

3128 Accesses

Abstract

We recommend a guiding heuristic to locate a sufficiently-sized multilayer perceptron (MLP) for larger datasets. Expected to minimise the search scope, it is based on experimental research into the comparative performance of 14 existing approaches with global minimum ranges on 31 larger datasets. The most consistent performer was Baum’s [1] equation that sets the number of hidden neurons equal to the square root of the number of training instances.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Determining Optimal Multi-layer Perceptron Structure Using Linear Regression

The Best Neural Network Architecture

MMLD Inference of Multilayer Perceptrons

Keywords

1 Introduction

Trained under supervision, a 3-layer multilayer perceptron (MLP) will find ‘hidden’ relationships within a set of data by approximating continuous functions [2]. The trained network may then be used for prediction tasks on previously unseen data from the same domain, with the final configuration unique to each specific dataset. The size of the hidden middle layer, N^h, has a strong bearing on the prediction accuracy of the final model [3], yet the predominant technique to locating N^h is resizing through trial-and-error. Exhaustive search through a range of N^h becomes problematic with larger datasets, increasing demands on processor capacity and extending the time required for training. The usefulness of the heuristic proposed in this paper is in minimising the scope of the search to reach a suitably optimal network size. We note that a reasonable network architecture may not be limited to a single ‘correct’ configuration [4] so long as the underlying function can be learnt while retaining enough smallness to generalise [5].

A set of proposed mathematical relationships between N^h and the numbers of input neurons, Nⁱ, output neurons, N^o, both fixed, and instances of the dataset used for training, N^tr, is summarised in Table 1. We used N^tr for our calculations rather than N^TOT, total number of instances, as it directly relates to the training process.

Table 1. Fourteen ways to determine N^h. Approach number was attributed randomly.

Full size table

1.1 Research Question

Which of the existing approaches can assist the search for a suitable number of neurons in the single hidden layer of a MLP for larger datasets?

2 Experiment

Our simple experiment investigates the performance of each approach when compared with global minimum benchmarks [19]. Thirty-one datasets with many attribute-target pairs or high dimensionality were sourced [20–22] (see Table 2).

Table 2. Characteristics of 31 datasets. Most are from http://archive.ics.uci.edu/ml/datasets except (b) http://mldata.org and (c) http://osmot.cs.cornell.edu/kddcup. Larger sets were excluded as too slow to train with available resources.

Full size table

A lower and upper limit to N^h was established for the training of each dataset based on calculations from the approaches in Table 1. We set lower bound at the calculation closest to 0, while upper bound was based on a sense of being able to train to that N^h, with flexibility to extend with working processor capacity. Where an approach takes the form of a lower or upper bound, the calculated N^hat the bound was used.

Weights were initialised randomly to represent prior knowledge [23]. Training, test and validation sets (70-15-15% of N^TOT) were also randomly generated for the best opportunity to locate the global minimum [24]. Each-sized network was trained 10 times with cross-validation, accounting for random influences [25]. We performed our experiment using MATLAB Neural Network Toolbox version 6 add-on’s patternnet function with the scaled conjugate backpropagation algorithm [26, 27].

3 Results

The global minimum was located for each dataset at the N^h with the smallest averaged performance error from the mean of squared errors comparing the actual output against the desired output [25]. Approaches (3) and (1) calculated the global minimum in one case each, WallFollRobot2 and AdultIncome respectively.

Not all approaches gave us a sensible calculation for N^h for every dataset. We obtained a result for all of the 31 datasets with approaches (4), (5) and (7) only. Table 3 demonstrates a combination of this raw count [A] and the count of datasets where performance at the approach’s calculated N^h intersects with the global minimum (95% CI) from a multiple comparison of means [B].

Table 3. An excerpt of the simple ranking of approaches according to relative usefulness, ordered from ‘most’ useful and truncated for brevity. [B] was scaled in the final column to indicate its relationship to the research question, with no impact on the final rank.

Full size table

Figure 1 gives an overview of two further comparisons with the performance range at the global minimum N^h. Single diamonds are derived from the count of individual performance measures for an approach within the global minimum range over all datasets. You can clearly see the success of approach (3) \( N^{h} = \sqrt {N^{tr} } \) in this, with occurrences 71.7% of times across all datasets.

The second set of comparisons is presented as bar graphs that have been separated into Nⁱ groupings to allow for disparity in attribute dimensionality across the datasets: \( N^{i} \le 10; \) \( 10 < N^{i} \le 50; \) \( 50 < N^{i} \le 100; \) and \( N^{i} > 100. \) This ratio is the per cent of times an average of the 10 performance measures at each pre-calculated N^h occurred within the range of performances recorded at the global minimum N^h, grouped by Nⁱ. Approaches (5) and (14) were both highly successful in the \( 50 < N^{i} \le 100 \) group (4 out of 5 cases), with (3) and (8)’s average occurring within the global minimum range for 5 out of the 7 cases in the \( N^{i} > 100 \) group.

Also of note, (2), (3), (8) and (9)’s averages placed in the global minimum range for the \( 50 < N^{i} \le 100 \) group in 3 of the 5 cases. The results for the two groups where \( N^{i} \le 50 \) (the remaining 19 datasets) were no better than 50%.

4 Discussion and Conclusion

We empirically determined a single, optimal structure between lower and upper bounds for N^h for each dataset, comparing the performance of each approach with the range at this global minimum in several ways.

All approaches other than (3) recorded an individual measurement in all datasets’ global minimum ranges in 50% or fewer cases. Approach (3)’s consistency (over 71%) is notable due to the variations between the 31 datasets.

With averaged performances, approaches (5) and (14)’s 80% success where \( 50 < N^{i} \le 100 \) is tempered by there being only 5 datasets in that group. In the initial ranking according to relative usefulness, approach (5) was ranked first, with (14) lower down. Both of these approaches consider a relationship with Nⁱ. In the \( N^{i} > 100 \) group, approaches (8) and (3) succeeded in 5 out of the 7 datasets. Both consider a relationship with N^tr. In the usefulness ranking, (8) was 11^th and (3) third. The success rate in the results grouped for all \( N^{i} \le 50 \) was 50% or less.

On the basis of these findings, we recommend the following heuristic: in cases of more than 50 attributes in a dataset, apply the highly successful approaches (5) and (14) for \( 50 < N^{i} \le 100 \) and (8) and (3) for \( N^{i} > 100. \) For other cases, use approach (3) for an indication of reasonable network performance.

References

Baum, E.B.: On the capabilities of multilayer perceptrons. J. Complex. 4, 193–215 (1988)
Article MathSciNet MATH Google Scholar
Cybenko, G.: Approximation by superpositions of a sigmoidal function. Math. Control Signals Syst. 2, 303–314 (1989)
Article MathSciNet MATH Google Scholar
Hornik, K.: Approximation capabilities of multilayer feedforward networks. Neural Netw. 4, 251–257 (1991)
Article Google Scholar
Zeng, X., Yeung, D.S.: Hidden neuron pruning of multilayer perceptrons using a quantified sensitivity measure. Neurocomputing 69, 825–837 (2006)
Article Google Scholar
Aran, O., Yildiz, O.T., Alpaydin, E.: An incremental framework based on cross-validation for estimating the architecture of a multilayer perceptron. Int. J. Pattern Recogn. Artif. Intell. 23, 159–190 (2009)
Article Google Scholar
Hecht-Nielsen, R.: Kolmogorov’s mapping neural network existence theorem. In: Proceedings of IEEE First Annual International Conference on Neural Networks, pp. III-11–III-14. (1987)
Google Scholar
Sprecher, D.A.: A universal mapping for kolmogorov’s superposition theorem. Neural Netw. 6, 1089–1094 (1993)
Article MATH Google Scholar
Barron, A.R.: Approximation and estimation bounds for artificial neural networks. Mach. Learn. 14, 115–133 (1994)
Article MATH Google Scholar
Rogers, L.L., Dowla, F.U.: Optimization of groundwater remediation using artificial neural networks with parallel solute transport modeling. Water Resour. Res. 30, 457–481 (1994)
Article Google Scholar
Somaratne, S., Seneviratne, G., Coomaraswamy, U.: Prediction of soil organic carbon across different land-use patterns. Soil Sci. Soc. Am. J. 69, 1580–1589 (2005)
Article Google Scholar
Denker, J.S., Schwartz, D., Wittner, B., Solla, S., Howard, R., Jackel, L., Hopfield, J.: Large automatic learning, rule extraction and generalization. Complex Syst. 1, 877–922 (1987)
MathSciNet MATH Google Scholar
Wanas, N.M., Auda, G.A., Kamel, M.S., Karray, F.O.: On the optimal number of hidden nodes in a neural network. In: IEEE Canadian Conference on Electrical and Computer Engineering 1998, vol. 2, pp. 918–921 (1998)
Google Scholar
Gallinari, P., Thiria, S., Soulie, F.F.: Multilayer perceptrons and data analysis. In: IEEE International Conference on Neural Networks 1988, vol.391, pp. 391–399 (1988)
Google Scholar
Shibata, K., Ikeda, Y.: Effect of number of hidden neurons on learning in large-scale layered neural networks. In: ICROS-SICE International Joint Conference 2009, pp. 5008–5013. SICE, Fukuoka International Congress Center, Japan (2009)
Google Scholar
Arai, M.: Bounds on the number of hidden units in binary-valued three-layer neural networks. Neural Netw. 6, 855–860 (1993)
Article Google Scholar
Huang, S.-C., Huang, Y.-F.: Bounds on the number of hidden neurons in multilayer perceptrons. IEEE Trans. Neural Netw. 2, 47–55 (1991)
Article Google Scholar
Deepa, S.N., Sheela, K.G.: Estimation of number of hidden neurons in back propagation networks for wind speed prediction in renewable energy systems. Draft (2013)
Google Scholar
Xu, S., Chen, L.: A novel approach for determining the optimal number of hidden layer neurons for FNNs and its application in data mining. In: Proceedings the 5th International Conference on Information Technology and Applications 23–26 June 2008, Cairns, Qld, pp. 683–686 (2008)
Google Scholar
Gorman, R.P., Sejnowski, T.J.: Analysis of hidden units in a layered network trained to classify sonar targets. Neural Netw. 1, 75–89 (1988)
Article Google Scholar
Bache, K., Lichman, M.: UCI Machine Learning Repository. School of Information and Computer Science, University of California, Irvine (2013)
Google Scholar
Hoyer, P.O., Ong, C.S., Henschel, S., Braun, M.L., Sonnenburg, S.: IDA Benchmark Repository, vol. 0.1.6. ML Group, Berlin (2013)
Google Scholar
ACM Special Interest Group on Knowledge Discovery and Data Mining: KDD Cup 2004: Particle physics; plus protein homology prediction. ACM (2004). http://www.kdd.org
Dayhoff, J.: Neural Network Architectures: An Introduction. International Thomson Computer Press, Boston (1996)
Google Scholar
http://ulcar.uml.edu/~iag/CS/Intro-to-ANN.html
Flexer, A.: Statistical evaluation of neural network experiments: minimum requirements and current practice, pp. 1005–1008. The Austrian Research Institute for Artificial Intelligence, Schottengasse 3, A-1010 (1994)
Google Scholar
Demuth, H., Beale, M., Hagan, M.: Neural Network Toolbox 6 User’s Guide. The MathWorks Inc., Natick (2009)
Google Scholar
Møller, M.F.: A scaled conjugate gradient algorithm for fast supervised learning. Neural Netw. 6, 525–533 (1993)
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Engineering and ICT, University of Tasmania, Launceston, Australia
Amanda Lunt & Shuxiang Xu

Authors

Amanda Lunt
View author publications
You can also search for this author in PubMed Google Scholar
Shuxiang Xu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Amanda Lunt .

Editor information

Editors and Affiliations

University of Tasmania, Hobart, Australia
Byeong Ho Kang
Auckland University of Technology, Auckland, New Zealand
Quan Bai

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lunt, A., Xu, S. (2016). An Empirically-Sourced Heuristic for Predetermining the Size of the Hidden Layer of a Multi-layer Perceptron for Large Datasets. In: Kang, B.H., Bai, Q. (eds) AI 2016: Advances in Artificial Intelligence. AI 2016. Lecture Notes in Computer Science(), vol 9992. Springer, Cham. https://doi.org/10.1007/978-3-319-50127-7_47

Download citation

DOI: https://doi.org/10.1007/978-3-319-50127-7_47
Published: 29 November 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-50126-0
Online ISBN: 978-3-319-50127-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

An Empirically-Sourced Heuristic for Predetermining the Size of the Hidden Layer of a Multi-layer Perceptron for Large Datasets

Abstract

Similar content being viewed by others