Keywords

1 Introduction

Trained under supervision, a 3-layer multilayer perceptron (MLP) will find ‘hidden’ relationships within a set of data by approximating continuous functions [2]. The trained network may then be used for prediction tasks on previously unseen data from the same domain, with the final configuration unique to each specific dataset. The size of the hidden middle layer, Nh, has a strong bearing on the prediction accuracy of the final model [3], yet the predominant technique to locating Nh is resizing through trial-and-error. Exhaustive search through a range of Nh becomes problematic with larger datasets, increasing demands on processor capacity and extending the time required for training. The usefulness of the heuristic proposed in this paper is in minimising the scope of the search to reach a suitably optimal network size. We note that a reasonable network architecture may not be limited to a single ‘correct’ configuration [4] so long as the underlying function can be learnt while retaining enough smallness to generalise [5].

A set of proposed mathematical relationships between Nh and the numbers of input neurons, Ni, output neurons, No, both fixed, and instances of the dataset used for training, Ntr, is summarised in Table 1. We used Ntr for our calculations rather than NTOT, total number of instances, as it directly relates to the training process.

Table 1. Fourteen ways to determine Nh. Approach number was attributed randomly.

1.1 Research Question

Which of the existing approaches can assist the search for a suitable number of neurons in the single hidden layer of a MLP for larger datasets?

2 Experiment

Our simple experiment investigates the performance of each approach when compared with global minimum benchmarks [19]. Thirty-one datasets with many attribute-target pairs or high dimensionality were sourced [2022] (see Table 2).

Table 2. Characteristics of 31 datasets. Most are from http://archive.ics.uci.edu/ml/datasets except (b) http://mldata.org and (c) http://osmot.cs.cornell.edu/kddcup. Larger sets were excluded as too slow to train with available resources.

A lower and upper limit to Nh was established for the training of each dataset based on calculations from the approaches in Table 1. We set lower bound at the calculation closest to 0, while upper bound was based on a sense of being able to train to that Nh, with flexibility to extend with working processor capacity. Where an approach takes the form of a lower or upper bound, the calculated Nhat the bound was used.

Weights were initialised randomly to represent prior knowledge [23]. Training, test and validation sets (70-15-15% of NTOT) were also randomly generated for the best opportunity to locate the global minimum [24]. Each-sized network was trained 10 times with cross-validation, accounting for random influences [25]. We performed our experiment using MATLAB Neural Network Toolbox version 6 add-on’s patternnet function with the scaled conjugate backpropagation algorithm [26, 27].

3 Results

The global minimum was located for each dataset at the Nh with the smallest averaged performance error from the mean of squared errors comparing the actual output against the desired output [25]. Approaches (3) and (1) calculated the global minimum in one case each, WallFollRobot2 and AdultIncome respectively.

Not all approaches gave us a sensible calculation for Nh for every dataset. We obtained a result for all of the 31 datasets with approaches (4), (5) and (7) only. Table 3 demonstrates a combination of this raw count [A] and the count of datasets where performance at the approach’s calculated Nh intersects with the global minimum (95% CI) from a multiple comparison of means [B].

Table 3. An excerpt of the simple ranking of approaches according to relative usefulness, ordered from ‘most’ useful and truncated for brevity. [B] was scaled in the final column to indicate its relationship to the research question, with no impact on the final rank.

Figure 1 gives an overview of two further comparisons with the performance range at the global minimum Nh. Single diamonds are derived from the count of individual performance measures for an approach within the global minimum range over all datasets. You can clearly see the success of approach (3) \( N^{h} = \sqrt {N^{tr} } \) in this, with occurrences 71.7% of times across all datasets.

Fig. 1.
figure 1

Performance at calculated Nh compared with global minimum range over 31 datasets.

The second set of comparisons is presented as bar graphs that have been separated into Ni groupings to allow for disparity in attribute dimensionality across the datasets: \( N^{i} \le 10; \) \( 10 < N^{i} \le 50; \) \( 50 < N^{i} \le 100; \) and \( N^{i} > 100. \) This ratio is the per cent of times an average of the 10 performance measures at each pre-calculated Nh occurred within the range of performances recorded at the global minimum Nh, grouped by Ni. Approaches (5) and (14) were both highly successful in the \( 50 < N^{i} \le 100 \) group (4 out of 5 cases), with (3) and (8)’s average occurring within the global minimum range for 5 out of the 7 cases in the \( N^{i} > 100 \) group.

Also of note, (2), (3), (8) and (9)’s averages placed in the global minimum range for the \( 50 < N^{i} \le 100 \) group in 3 of the 5 cases. The results for the two groups where \( N^{i} \le 50 \) (the remaining 19 datasets) were no better than 50%.

4 Discussion and Conclusion

We empirically determined a single, optimal structure between lower and upper bounds for Nh for each dataset, comparing the performance of each approach with the range at this global minimum in several ways.

All approaches other than (3) recorded an individual measurement in all datasets’ global minimum ranges in 50% or fewer cases. Approach (3)’s consistency (over 71%) is notable due to the variations between the 31 datasets.

With averaged performances, approaches (5) and (14)’s 80% success where \( 50 < N^{i} \le 100 \) is tempered by there being only 5 datasets in that group. In the initial ranking according to relative usefulness, approach (5) was ranked first, with (14) lower down. Both of these approaches consider a relationship with Ni. In the \( N^{i} > 100 \) group, approaches (8) and (3) succeeded in 5 out of the 7 datasets. Both consider a relationship with Ntr. In the usefulness ranking, (8) was 11th and (3) third. The success rate in the results grouped for all \( N^{i} \le 50 \) was 50% or less.

On the basis of these findings, we recommend the following heuristic: in cases of more than 50 attributes in a dataset, apply the highly successful approaches (5) and (14) for \( 50 < N^{i} \le 100 \) and (8) and (3) for \( N^{i} > 100. \) For other cases, use approach (3) for an indication of reasonable network performance.