1 Introduction

NeuroEvolution (NE) is the application of Evolutionary Algorithms (EA) to the training of Artificial Neural Networks (ANN) [11, 48]. NE’s history began by evolving the connection weights of fixed topology ANNs [30, 46]. This method brought many advantages over the still popular gradient based methods; such as simple back propagation [31]. These advantages include: being able to natively escape local optima, being less sensitive to the initial connection weights, being suited to deep ANNs and not requiring that each neuron’s Transfer Function (TF) be differentiable [49]. NE is also suited to reinforcement learning as well as supervised learning; whereas back propagation is only suited to supervised learning. Other ANN training methods such as restricted Bolzmann machines are also suited to unsupervised learning [34].

A significant advantage of NE is its ability to evolve the topology of ANNs; as well as the connection weights. Topology evolving NE methods include: GNARL [1], NEAT [35], SAGA [7] and CGPANN [14, 39]. This ability to automatically create suitable topologies is significant as topology has been shown to strongly influence the effectiveness of back propagation [16] and weight only evolving NE [40]. Evolving the topology of ANNs has even been shown to be more important to training than evolving connection weights [40]. Although some non-evolutionary ANN training methods do adapt topology, they typically achieve this by iteratively adding or removing neurons during training. This approach is akin to a local search of topologies, and is consequently likely to become trapped in locally sub-optimal topologies [1]. It has been shown that using simple back propagation with hand crafted topologies produce results as good as NE [4]. This result demonstrates the benefit of topology optimising NE; the topology is also optimised and does not have to be hand crafted by trial and error. Finally gradient descent methods struggle to train deep ANNs [12, 16] whereas the depth of the network has no impact on NE algorithms. This coupled with the fact that deep neural networks are thought to be more efficient in terms of the number of neurons required to solve a task [3] is another benefit of topology optimising NE.

Interestingly, NE can also be used to optimise the TF of each neuron within heterogeneous ANNsFootnote 1. However, this capability of NE has been widely overlooked in recent research. Indeed, at the turn of the twenty-first century many ANN publications stated that more research was required concerning the optimisation of TFs: “Relatively little has been done on the evolution of node transfer functions, let alone the simultaneous evolution of both topological structure and node transfer functions” [49, “The current emphasis in neural network research is on learning algorithms and architectures, neglecting the importance of transfer functions” [8] and “Selection and/or optimisation of transfer functions performed by artificial neurons have been so far little explored ways to improve performance of neural networks in complex problems” [9]. However, a search of the literature reveals that there has been little active research in this area. This paper intends to help fill this gap by showing how NE can easily optimise neuron TFs during evolution and that doing so produces strongly beneficial results.

The remainder of this paper is structured as follows. Section 2 discusses research literature relating to the application NE to the evolution of the TFs of ANNs. Section 3 describes the investigations which were undertaken using NE to evolve heterogeneous ANNs, leading to the results given in Sect. 4. Finally Sect. 5 discusses the overall findings with closing conclusions given in Sect. 6.

2 Background

There are many ANN TFs found in the literature [9]. However, the majority of NE implementations only evolve homogeneous ANNs of logistic or Gaussian functions, which have both been shown capable of universal approximation; [13] and [26] respectively. Of those which do evolve heterogeneous ANNs, there are two main methods.

The first method selects the TF of each neuron from a predetermined list of TFs. Training methods which use this method include General Neural Networks (GNN) [17]; which randomly adds or removes logistic or Gaussian TFs using an evolutionary programming method. GNN is also a hybrid approach which makes use of back propagation during training. Other NE methods which select specific TFs for each neuron include Parallel Distributed Genetic Programming (PDGP) [28], a modified Hierarchical Co-evolutionary Genetic Algorithm (\(\hbox {HCGA}_2\)) [45] and Cartesian Genetic Programming of Artificial Neural Networks (CGPANN) [14, 39]. These methods use genes to encode which TF is used by each neuron. These genes are then subject to mutation and/or crossover during evolution.

The second method by which NE can optimise neuron TFs is to use TFs which are described by a number of parameters [9]. The training methods then optimise these parameters for each individual neuron. A simple version of this technique has been used by CGPANN [19]; where the widths of Gaussian functions were optimised for each neuron. Again the parameter(s) associated with each neuron’s TF were encoded in the chromosome by the inclusion of additional gene(s). A more complex version of this method was used in [2] where each neuron’s TF was itself an evolved Genetic Program. This method allowed for an almost limitless variation of TFs. Another example where each neuron is described by a number of genes, is state-enhanced neural networks [25], where the dynamics of each neuron are evolved. These state-enhanced neural network exhibit memory which can be utilised on partially observable Markov decision tasks.

Until now however, there has been little research which empirically and rigorously investigates if the ability for NE to evolve heterogeneous ANNs actually provides any benefit. This is important research as if it is shown to be beneficial it could easily be adopted by other NE methods; as the described methods just require additional genes for each neuron. As discussed there are two ways in which NE can evolve TFs: (1) by choosing the TF of each neuron from a predetermined list or (2) by optimising parameters associated with each individual neuron. Additionally these two methods can be combined by allowing evolution to both select the TF for each neuron and optimises the parameters associated with the TF. Here both of these methods are investigated along with a combination of the two. The investigation uses two NE strategies and compares the results to evolving regular homogeneous ANNs.

3 Investigation

The investigation presented on evolving heterogeneous ANN using NE takes four parts. The first investigation is to identify if the choice of TF impacts on the effectiveness of NE for evolving homogeneous ANNs. The second investigates if evolving heterogeneous ANNs, by allowing evolution to select each neuron’s TF from a predetermined list, outperforms evolving homogeneous ANNs. The third investigates if using NE to optimise parameters associated with each neuron’s TF outperforms evolving homogeneous ANNs. The fourth investigates using NE to both select each neuron’s TF from a predetermined list and optimise parameters associated with that TF.

The remainder of this section introduces the NE methods employed by the investigation, the TFs made available and the benchmarks used.

3.1 NeuroEvolutionary Strategies

In order to undertake the described experiments, two NE methods were used; this is to ensure that any conclusions are not specific to a particular type of NE. The chosen NE methods are Conventional NeuroEvolution (CNE) and Cartesian Genetic Programming of Artificial Neural Networks (CGPANN). CNE is the simplest (and oldest) form of NE and evolves the connection weights of fixed topology ANNs. CGPANN is a more complex NE method which evolves both the connection weights and topology of ANNs. These two NE methods represent the two main types of NE; those which evolve only connection weights and those which evolve connection weights and topologyFootnote 2.

3.1.1 Conventional NeuroEvolution

Conventional NeuroEvolution [30] operates by storing the connection weights of a fixed topology ANNs as an array of floating point numbers. Each of these arrays represents a chromosome. Mutation is implemented by selecting a new random weight value for each gene (weight) with a given probability. CNE is extended here to be capable of evolving each neuron’s TF by the inclusion of an additional gene per neuron. These additional TF genes can either be used as an index in a look-up-table of TFs, or as a parameter value to be used by each neuron’s TF. As CNE uses fixed topologies, this topology must be selected in advanced by the user.

3.1.2 Cartesian Genetic Programming of Artificial Neural Networks

CGPANN [14, 39] is the application of Cartesian Genetic Programming (CGP) to the evolution of ANNs. CGP [22, 24] is a form of Genetic Programming (GP) which represents computational structures as directed graphs of nodes indexed by their Cartesian coordinates. CGP does not suffer from bloat [21, 41]. Bloat refers to the condition where during evolutionary time the size of evolved computational structures grow without limit. Many other forms of GP do suffer from bloat and there has been intensive research to address the issue [33]; interestingly other graph based encoding schemes have also been shown to suffer less from bloat that standard GP [32]. CGP chromosomes also contain non-functioning genes enabling neutral genetic drift during evolution [44, 50]. CGP typically evolves acyclic networks but can also be easily adapted to evolve cyclic or recurrent networks [42]. CGP typically uses point or probabilistic mutation and no crossover.Footnote 3

Other forms of graph based GP have also been proposed including Parallel Algorithm Discovery and Orchestration (PADO) [36] and Parallel Distributed Genetic Programming (PDGP) [27]. PDGP has also been applied to NE [27]. PDGP is similar in structure to CGP, graph based, but uses different mutation and crossover operations. CGP also placed fewer restraints on the structure of the generated graphs [22]. Research relating to PDGP appears to have been largely abandoned.

Each CGP chromosome is comprised of a number of gene types: function genes (\(F_i\)), connection genes (\(C_{i,j}\)) and output genes (\(O_i\)). The function genes represent indexes in a function look-up-table and describe the functionality of each node. The connection genes define from where each node gathers its inputs. For regular acyclic CGP, connection genes may connect a given node to any previous node in the program, or any of the program inputs. The output genes address any program input or internal node and define which nodes are used as program outputs.

Originally CGP programs were organized with nodes arranged in rows (nodes per layer) and columns (layers); with each node indexed by its row and a column. However, this is an unnecessary constraint, as any configuration possible using a given number of rows and columns is also possible using one row with many columns; provided the total number of nodes remains constant. This is due to CGP being capable of evolving where each node connects its inputs. Consequently, here the chromosomes are defined with one row and \(n\) columns; with each node only indexed by its column. A generic (one row) CGP chromosome is given in Eq. 1; where \(\alpha\) is the arity of each node, \(n\) is the number of nodes and \(m\) is the number of program outputs.

An example CGP program is given in Fig. 1 along with its corresponding chromosome. As can be seen, all nodes are connected to previous nodes or program inputs. Not all program inputs have to be used, enabling evolution to decide which inputs are significant. An advantage of CGP over tree-based GP, again seen in Fig. 1, is that node outputs can be reused multiple times, rather than requiring the same value to be recalculated if it is needed again. Finally, not all nodes contribute to the final program output, these represent the inactive nodes which enable neutral genetic drift and make variable length phenotypes possible.

$$\begin{aligned} F_0 C_{0,0} \ldots C_{0,\alpha } F_1 C_{1,0} \ldots C_{1,\alpha } \ldots \ldots F_{n}C_{n,0} \ldots C_{n,\alpha } O_0 \ldots O_m \end{aligned}$$
(1)
Fig. 1
figure 1

Example CGP program corresponding to the chromosome: 012 233 124 4

Cartesian Genetic Programming is easily applied to ANNs [14, 39] by the inclusion of connection weight genes (\(W_{i,j}\)) for each node input and by using TFs suited to ANNs. CGPANN exhibits all of the benefits of CGP and is a NE training method which can evolve the weights, topology [40] and TFs of ANNs. Although CGP evolves topology, it is required that the user specifies a maximum network size. This could be considered a drawback, but overestimating the required number of nodes has been shown to be highly beneficial for CGP [23]. Similarly, a maximum neuron arity must be specified, however, the arity of each neuron can be lower than this maximum [39]. This occurs when the chromosome describes two neurons being connected by two or more connections. In this case, multiple connections between two neurons are equivalent to one connection; with the connection weight value being the sum of the individual weights.

It is important to note that the types of ANN created using CGPANN are unconventional and often cannot be described in terms of layers and nodes per layer. Figure 2 gives an example of the type of ANN which can be created using CGPANN. It can be seen that each neuron’s input is highly unconstrained; they can connect to any previous neuron in the network including input neurons. It can also be seen that the arity of each neuron can vary. Additionally any neuron can be used as an output; including the input neurons. Figure 2 demonstrates that by allowing NE to optimise topology, evolution is capable of discovering topologies which would be unlikely to be considered by a human designer.

Fig. 2
figure 2

Depiction of the types of ANN created using CGPANN

3.2 Transfer functions

The TFs used in this investigation are the Heaviside step function, Eq. 2, the Gaussian function, Eq. 3, and the logistic sigmoid functionFootnote 4, Eq. 4. Each of these TFs is shown graphically in Fig. 3. These particular TFs were selected as they are the most commonly used by ANNs.

Fig. 3
figure 3

Form left to right: Heaviside step function, Gaussian function and the logistic function. With \(\sigma = 1\) for the Gaussian and logistic functions

As can be seen in Eqs. 3 and 4, the Gaussian and logistic functions have been given in a form which contains a \(\sigma\) variable. Where \(\sigma = 1\) gives the typical form of these TFs. When using NE to evolve parameters associated with each neuron’s TF, the \(\sigma\) value can be evolved or optimised. Figures 4 and 5 show the Gaussian and logistic function respectively for a range of \(\sigma\) values.

$$\begin{aligned}&f(x)= {\left\{ \begin{array}{ll} 1,&{} \text {if }\,x\ge 0\\ 0,&{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(2)
$$\begin{aligned}&f(x)= \exp \left( -\frac{x^2}{2 \sigma ^2} \right) \end{aligned}$$
(3)
$$\begin{aligned}&f(x)= \frac{1}{1 + \exp (-\sigma x)} \end{aligned}$$
(4)
Fig. 4
figure 4

Variable Gaussian function. From left to right \(\sigma\) = 1, 2 and 3

Fig. 5
figure 5

Variable logistic function. From left to right \(\sigma\) = 1, 2 and 3

3.3 Benchmarks

In order to draw strong conclusions regarding whether it is beneficial to evolve TFs, it is necessary to examine its effectiveness on a wide range of benchmarks. In this paper five benchmarks are employed. The chosen benchmarks mainly include supervised learning classification tasks, a common application of ANNs, but also include a reinforcement learning control task.

Despite many of the described benchmarks being classification tasks, they each use their own type of fitness function. Although this adds complexity, the fitness functions used here are those typically used with these benchmarks. This is done to make standard the use of these benchmarks; which is important when comparing machine learning methods.

3.3.1 Ball Throwing

The ball throwing benchmark [15] is a reinforcement learning control task. The task is to design a controller for a driven arm so as to throw a ball a distance of \(\ge\)9.5 m. A depiction of the task is given in Fig. 6, with the equations describing the dynamics of the arm given in Eqs. 5 and 6; symbol definitions given in Table 1. The model is simulated using Euler integration with a time step of 0.01 s for 3,000 time steps. The control system has two inputs \(\theta\) and \(\omega\) and outputs two values \(T\) and whether or not to release the ball. The inputs to the controller are linearly scaled from \(\pm \pi /2\) and \(\pm 5\) rad/s to a [0,1] range for \(\theta\) and \(\omega\) respectively. The first output of the controller sets the torque applied to the arm and is linearly mapped to a \([-5,5]\) N range. The ball is released if the second output exceeds a threshold of 0.5. Once the ball is released, Newtonian mechanics are used to calculate the distance the ball is thrown (\(d\)) which is then used as the fitness value.

Fig. 6
figure 6

Depiction of the Ball Throwing benchmark

$$\begin{aligned}&\left( \dot{\theta },\dot{\omega }\right) = \left( \omega , -c\cdot w + \frac{g\cdot \sin (\theta )}{l} + \frac{T}{m\cdot l^2} \right) \end{aligned}$$
(5)
$$\begin{aligned}&\omega = 0 \quad \text{ if } \vert \theta \vert \ge \pi /2 \end{aligned}$$
(6)
Table 1 Ball throwing symbol definitions with commonly used values

3.3.2 Full Adder

The full adder benchmark is the task of implementing a full adder circuit using an ANN. The ANN has three inputs (two input bits and a carry bit) and two outputs (one for the sum bit and the other for the carry out). Each output is decoded as a ‘1’ if \(\ge\)0.5, otherwise it is decoded as a ‘0’. The fitness value assigned to each chromosome is the number of correct output bits generated after every possible input pattern has been applied; see Table 2. This results in a maximum fitness of sixteen.

Table 2 Full Adder truth table

3.3.3 Monks Problem 1

The Monks Problems [37] are a set of three classification benchmarks intended for comparing learning algorithms. The classification tasks are based on the appearance of robots which are described by six attributes, each with a range of values; see Table 3. Only the first classification task is used here, where a robot belongs to a class if head_shape = body_shape OR jacket_color = red. This task is one of the recommended non-trivial benchmarks given in [10]. The task uses 124 of the possible 432 combinations for the training set and the remainder for the testing set. The implementation commonly used by ANN is to assign each value of each attribute its own input to the network; totaling seventeen inputs. Each of these inputs is set as ‘1’ if the particular attributes value is present and as ‘0’ otherwise. The ANN classifies each sample as belonging to the class if the single ANN output is \(\ge\)0.5. The target fitness is zero percent classification error.

Table 3 Monks Problem robot descriptions

3.3.4 Two Spirals

The two spirals classification benchmarks was created in the 1980s and was originally posted on a connectionist mailing list by Alexis Wieland [5]. The task is considered highly challenging for ANN [6]. The benchmark consists of 194 data points describing samples taken from two spirals in Cartesian space; see Fig. 7. The task is to classify to which spiral each sample belongs using only the \((x,y)\) Cartesian coordinates. The target fitness is zero, meaning that there are no incorrect classifications. The ANN comprises of two inputs for the \((x,y)\) Cartesian coordinates of each sample, and one output. The two inputs are linearly scaled into a \([0,1]\) range by assuming the maximum values are \(\pm 6.5\) and \(\pm 6\) for the \(x\) and \(y\) axis respectively. When the output value is \(<\)0.5 it is interpreted as one spiral and \(\ge\)0.5 as the other.

Fig. 7
figure 7

Depiction of the two spiral classification benchmark

3.3.5 Proben1: Cancer1

The Cancer1 datasetFootnote 5 is a classification task described in the Proben1 document [29]. The dataset was originally constructed at the University of Wisconsin Hospital [18]. Each sample in the dataset describes nine values, recorded by a surgeon using fine needle aspiration, of a tumour located in the breast of patients. Each sample is labelled with two mutually exclusive flags, benign and malignant, indicating the tumour type. All of the values are scaled into a \([0,1]\) range. The dataset contains 699 samples, 65.5 % of which represent benign tumours. The first 525 samples are used as the training set with the remainder used for the testing set. The fitness assigned to each chromosome is the squared error percentage, Eq. 7. Where \(o _{min}\) and \(o _{max}\) are the minimum and maximum output values form the ANN, \(N\) is the number of outputs from the ANN, \(P\) is the number of training examples, \(o _{pi}\) are the actual output values from the ANN and \(t _{pi}\) are the target outputs. Therefore the optimum corresponds to a squared percentage error equal to zero.

$$\begin{aligned} E = 100 \cdot \frac{o _{max} - o _{min}}{N \cdot P} \sum _{p=1} ^{P} \sum _{i=1} ^{N} (o _{pi} - t _{pi}) ^{2} \end{aligned}$$
(7)

4 Results

Four experiments are presented here which investigate the influence of TFs when using NE to train ANNs. All of the results presented are the average fitness of the best solutions found from fifty repeated runs. Each run was terminated after 100,000 generations, all used a \((1+4)\)-ESFootnote 6, 3 % probabilistic mutationFootnote 7 and connection weights in the range \(\pm\)5. When using CNE, three hidden layers were used, each containing ten neurons; plus one input layer and one output layer. The arity of each neuron was such that the ANN was fully connected between layers. When using CGPANN the maximum number of nodes was set as thirty each with a maximum arity of ten.

Where appropriate, the results are compared using the non-parametric two sided Mann–Whitney U test and effect size [43] statistics. A U test value of \(<\)0.05 indicates that the difference between two datasets is statistically significant. The effect size value shows the important of this difference considering the spread of the data; with values \(>\)0.56 showing small importance, \(>\)0.64 medium importance and \(>\)0.71 large importance. Therefore if a comparison between results is shown to be statically significant with a medium or large effect size, then we can be reasonably sure that any difference is not due to under sampling and that the difference is significantly large.

4.1 Experiment 1: Homogeneous Networks

The first experiment identifies whether, and to what extent, the choice of TF impacts on the effectiveness of training homogeneous ANNs using NE. As previously discussed, the three TFs used for this investigation are the Heaviside step, Gaussian and logistic functions; see Sect. 3.2.

The average fitness achieved when using each TF is given for the five benchmarks in Tables 4 and 5; when using CNE and CGPANN respectively. The average fitness value is given in bold if it represents the best fitness for that benchmark; indicating the most suitable TF for that benchmark. When appropriate, the fitness is given for the training and testing sets. Where the testing fitness is the average fitness achieved by each of the fifty runs on the testing set after training on the training set is complete. The statistical significance between the fitnesses achieved using each TF are given in Tables 6 and 7; when using CNE and CGPANN respectively. When the difference is statistically significant, \(p\le 0.05\), the value is given in bold. The effect size of the differences between the fitnesses are also given in Tables 8 and 9; when using CNE and CGPANN respectively. When the effect size is of medium or greater importance the value is given in bold.

Table 4 Average fitness achieved using homogeneous ANNs of different TFs trained using CNE
Table 5 Average fitness achieved using homogeneous ANNs of different TFs trained using CGPANN
Table 6 Statistical significance between the homogeneous CNE fitness results given in Table 4
Table 7 Statistical significance between the homogeneous CGPANN fitness results given in Table 5
Table 8 Effect size between the homogeneous CNE fitness results given in Table 4
Table 9 Effect size between the homogeneous CGPANN fitness results given in Table 5

In all cases a perfect solutions was found for the Full Adder benchmark and similarly in many cases for the Ball Throwing. When perfect solutions are found no comparisons can be made in terms of the fitnesses achieved. For this reason the number of required generations to find perfect solutions are also presented in for these cases. Tables 10 and 11 give the average number of generations required by CNE and CGPANN for the cases where perfect solutions were found. These generation results are analysed using the same methods for testing statistical significance as before, given in Tables 12 and 13 for CNE and Tables 14 and 15 for CGPANN.

Table 10 Average number of generations required to find optimal solutions using homogeneous ANNs of different TFs trained using CNE
Table 11 Average number of generations required to find optimal solutions using homogeneous ANNs of different TFs trained using CGPANN
Table 12 Statistical significance between the homogeneous CNE generation results given in Table 10
Table 13 Effect size between the homogeneous CNE generation results given in Table 10
Table 14 Statistical significance between the homogeneous CGPANN generation results given in Table 11
Table 15 Effect size between the homogeneous CGPANN generation results given in Table 11

In the case of CGPANN, all TFs found a solution to the Ball Throwing benchmark; to throw the ball a distance of \(\ge\)9.5 m. Interestingly some TFs managed, on average, to throw the ball much further than the 9.5 m target. This could be framed as greater generalisation. However this aspect of the ball throwing results are not analysed further in this paper.

From the results given in Tables 4 and 5 it can be seen, for both CNE and CGPANN, that the choice of TF has a large impact on the effectiveness of NE. Additionally, in the majority of cases these differences are shown to be statistically significant and with a medium or large effect size. This confirms that the choice of TF has a large impact on the effectiveness of evolving homogeneous ANN using NE.

A further interesting result, also seen in Tables 4 and 5, is that despite being the least commonly used of the three TFs, the Heaviside step function produced the best results in many cases. It can also be seen that the best TF was often also dependent on the NE method used.

4.2 Experiment 2: Heterogeneous Networks

The second experiment identifies if allowing NE to evolve heterogeneous ANNs, by selecting each neuron’s TF from a predetermined list, produces better results than evolving homogeneous ANNs. Evolving the TF used by each neuron is considered beneficial if the result is better than the average of using each TF individually. This measure is chosen because when approaching a new task it not generally known which TF would be most suited, therefore a TF would have to be selected arbitrarily. When evolving heterogeneous ANNs the need to make this choice is removed, and hence it should be considered beneficial if it beats the average random choice of TF. The average fitness for evolving homogeneous ANNs using each TF individually for the five benchmarks are given in Tables 4 and 5 for CNE and CGPANN respectively. In the cases where a perfect solution is always found the required number of generations is also given and used for the comparison.

The results achieved when evolving heterogeneous ANN are given in Tables 16 and 17 for CNE and CGPANN respectively. The results are given in bold if the fitness is better, or equal, to the average of using each TF individually; or the average number of generations required to find a perfect solution is equal or lower. The percentage of neurons which use each TF is also given in Tables 16 and 17; this is only for the active nodes in the CGPANN case. No statistical analysis can be undertaken for this experiment as the comparison is against the average result of using each TF individually.

Table 16 Average fitness achieved using heterogeneous ANNs trained using CNE
Table 17 Average fitness achieved using heterogeneous ANNs trained using CGPANN

As can be seen in Tables 16 and 17, in the majority of cases evolving heterogeneous ANNs outperformed the average result of evolving homogeneous ANNs. This indicates that evolving heterogeneous ANNs is typically a better strategy than evolving homogeneous ANNs. This holds unless the user knows in advance which TF is most suited to a given task; in which case that TF should be used. Interestingly in the case of CNE applied to the ball throwing benchmark, evolving heterogeneous ANNs produce an equal fitness to that of using the best TF alone. Additionally for CNE applied to the Proben1:Cancer1 benchmark, heterogeneous ANNs produce a better fitness on the test set than using the best TF alone. This indicates that it may be the ability to mix TFs, as well as choose which are used, which is offering an evolutionary advantage.

4.3 Experiment 3: Evolving Transfer Function Parameters

The third experiment is to identify if optimising parameters associated with each neuron is beneficial for NE. As previously discussed, the parameters to be optimised vary the shape of the Gaussian and logistic functions; see Sect. 3.2. In each case the ANNs are comprised of the same TF, Gaussian or logistic, but a parameter controlling the shape of each neuron’s TF is evolved. Here the parameter values for each TF are limited to the set {1, 2, 3}, see Eqs. 3 and 4; but this is not a requirement of the method.

Evolving parameters associated with each neuron’s TF will be considered beneficial if it produces stronger results than the use of the non-parametrised counterpart e.g. if variable Gaussian produces stronger results than the standard Gaussian TF. This comparison is used as it isolates the aspect we are interested in and tests whether the ability to vary the shape of each neuron’s TF provides any benefit.

The average fitnesses achieved using variable Gaussian and variable logistic functions on the five benchmarks are given in Tables 18 and 20 respectively when using CNE. In cases where the target fitness is always reached the average number of generations used are given in Tables 19 and 21. Similarly, Tables 22 and 24 give the results when using CGPANN. Again in cases where the target fitness is always reached the average number of generations used are given in Tables 23 and 25. In all cases the results are compared against those obtained for the non-variable form of the TFs. In the given tables, a bold fitness/number of generations indicates that the variable TF performed better than the non-variable form. Additionally, a bold value for the the U test indicates statistical significance and a bold value for the effect size indicates a medium or large effect size. For instance, if the fitness, U test and effect size values are all given in bold then the variable TF is shown to strongly outperform the non-variable counterpart with statistical significance. If however the value is not bold, but the U test and effect size values are bold, then this shows that the non-variable TF strongly outperformed the variable counterpart with statistical significance. If the U test is bold but the effect size is not bold, then then this indicates a weak difference with statistical significance. If the U test value is not bold then the difference between non-variable and variable TFs is not statistically significant and is considered insignificant.

It can be seen in Tables 18, 19, 20, 21, 22, 23, 24, 25, in the majority of cases, the variable version of the TF outperformed the non-variable form. Additionally, many instances where the variable form is superior are also shown to be statistically significance with a medium to large effect size. Only four of the twenty cases show the non-variable form outperforming the variable form with statistical significance and medium to large effect size. Eleven of the twenty cases showed the variable form outperforms the non-variable form with statistical significance with a medium to large effect size. Of the remaining five cases, one showed variable TFs offered a weak disadvantage and the remainder showed no significant differenceFootnote 8. Therefore using variable TFs is shown to be often beneficial and rarely worse.

Table 18 Average fitness of ANNs of variable Gaussian TFs trained using CNE
Table 19 Average number of generations required to find optimal solutions using ANNs of variable Gaussian TFs trained using CNE
Table 20 Average fitness of ANNs of variable logistic TFs trained using CNE
Table 21 Average number of generations required to find optimal solutions using ANNs of variable logistic TFs trained using CNE
Table 22 Average fitness of ANNs of variable Gaussian TFs trained using CGPANN
Table 23 Average number of generations required to find optimal solutions using ANNs of variable Gaussian TFs trained using CGPANN
Table 24 Average fitness of ANNs of variable logistic TFs trained using CGPANN
Table 25 Average number of generations required to find optimal solutions using ANNs of variable logistic TFs trained using CGPANN

The differences can also be given in terms of the NE method used. In seven of the ten cases which used CNE, the variable TFs offered an advantage compared with three cases where it was a disadvantage. For CGPANN, five out of the ten cases showed variable TFs offered an advantage compared to one of the ten cases showing a disadvantage. The results can also be given in terms of the TF employed. In six of the ten cases the variable logistic TF offered and advantage and in two cases a disadvantage. In five of the ten cases the variable Gaussian TF offered an advantage and in three cases a disadvantage.

4.4 Experiment 4: Evolving Heterogeneous Networks and Transfer Function Parameters

The fourth experiment investigates allowing NE to evolve heterogeneous ANNs whilst also optimising parameters associated with each neuron’s TF. The function set contains the step, Gaussian and logistic functions where the Gaussian and logistic functions can have their \(\sigma\) value in the set {1,2,3}; the step function is not parametrised.

Evolving heterogeneous ANNs which also optimise TF parameters will be considered beneficial if it produces stronger results than the use of the non-parametrised heterogeneous counterpart; given in Sect. 4.2. This comparison is made to see if including variable TFs can improve again upon the use of heterogeneous ANN.

The results of evolving heterogeneous ANN with variable TF parameters are given in Tables 26 and 28 when using CNE and CGPANN respectively. The tables give the average fitness for applying heterogeneous ANN with variable TFs to each of the five benchmarks. The average fitness is given in bold if it represents a better average fitness than that found when evolving non-parametrised heterogeneous ANNs. The U-test and effect size are also in the same format as in Sect. 4.3. As previously, if perfect solutions are always found the number of generations required to find perfect solutions is also give so as to allow comparison. These results are given in Tables 27 and 29 for CNE and CGPANN respectively.

Table 26 Average fitness of heterogeneous ANNs of variable TFs trained using CNE
Table 27 Average number of generations required to find optimal solutions using heterogeneous ANNs of variable TFs trained using CNE

As can be seen in Tables 26, 27, 28, 29, in the majority of cases the addition of varying TF parameters to evolving heterogeneous ANNs does not improve the performance. When using CNE it produced statistically significant superior results for the Ball Throwing benchmark and produced statistically significant worse results for the Two Spirals benchmark. When using CGPANN it produced no statistically significant superior results and produced statistically significantly worse results for the Two Spirals benchmark. In all other cases there was no statistical significance between the two techniques. It can therefore be concluded that the addition of variable TFs to evolving heterogeneous ANNs does not improve the performance over the use of non-variable TFs.

Table 28 Average fitness of heterogeneous ANNs of variable TFs trained using CGPANN
Table 29 Average number of generations required to find optimal solutions using heterogeneous ANNs of variable TFs trained using CGPANN

4.5 Box and whisker plots

For high-level inspection, all of the previously given results are presented as box and whisker plots. The boxes represent the lower and upper quartile ranges and the whiskers are set as 1.5 the interquartile range. Points greater than 1.5 of the interquartile range are labelled with a red ‘+’ and points greater than 3 times the interquartile range labelled with a red ‘\(\circ\)’. The median of each plot is given as a solid red line and the arithmetic mean is given as a dashed black line.

The average homogeneous fitness, given in Sect. 4.1, is also given as a dashed green line spanning the three homogeneous functions; step, Gaussian and logistic. The average homogeneous fitness is calculated as the arithmetic mean of the arithmetic means of the fitnesses achieved for the step, Gaussian and logistic function.

As a perfect fitness was often achieved for the Ball Throwing and the Full Adder benchmarks, the average number of generations to find the perfect solution are also given as box plots.

The CNE results are given in Figs. 8, 9, 10, 11, 12, 13, 14, 15, 16 and the CGPANN results are given in Figs. 17, 18, 19, 20, 21, 22, 23, 24, 25.

Fig. 8
figure 8

Fitnesses achieved from applying CNE to the Ball Throwing benchmark

Fig. 9
figure 9

Generations required from applying CNE to the Ball Throwing benchmark

Fig. 10
figure 10

Fitnesses achieved from applying CNE to the Full Adder benchmark

Fig. 11
figure 11

Generations required from applying CNE to the Full Adder benchmark

Fig. 12
figure 12

Fitnesses achieved from applying CNE to the monks problem 1 benchmark—training

Fig. 13
figure 13

Fitnesses achieved from applying CNE to the monks problem 1 benchmark—testing

Fig. 14
figure 14

Fitnesses achieved from applying CNE to the two spirals benchmark

Fig. 15
figure 15

Fitnesses achieved from applying CNE to the Proben Cancer1 benchmark—Training

Fig. 16
figure 16

Fitnesses achieved from applying CNE to the Proben Cancer1 benchmark—testing

Fig. 17
figure 17

Fitnesses achieved from applying CGPANN to the ball throwing benchmark

Fig. 18
figure 18

Generations required from applying CGPANN to the ball throwing benchmark

Fig. 19
figure 19

Fitnesses achieved from applying CGPANN to the full adder benchmark

Fig. 20
figure 20

Generations required from applying CGPANN to the full adder benchmark

Fig. 21
figure 21

Fitnesses achieved from applying CGPANN to the monks problem 1 benchmark—training

Fig. 22
figure 22

Fitnesses achieved from applying CGPANN to the monks problem 1 benchmark—testing

Fig. 23
figure 23

Fitnesses achieved from applying CGPANN to the two spirals benchmark

Fig. 24
figure 24

Fitnesses achieved from applying CGPANN to the Proben Cancer1 benchmark—training

Fig. 25
figure 25

Fitnesses achieved from applying CGPANN to the Proben Cancer1 benchmark—training

5 Discussion

This paper has empirically demonstrated that when using NE to create homogeneous ANNs the choice of TF has a large impact on the fitness of the solutions found. It was also shown that no single TF produced the best results in all cases. Therefore when using homogeneous ANNs one must accept possibly inferior results or repeat training with a range of TFs. This paper has also empirically demonstrated that, on average, evolving heterogeneous ANNs produces better results than the average of using each TF individually. Additionally it has been shown that optimising parameters associated with each neuron’s TF produces better results, on average, than using the typical fixed TFs. Interestingly however, a combination of evolving heterogeneous ANNs where each neuron’s TF can also be adapted was shown not to produce better results than simply evolving heterogeneous ANNs of static TFs.

The results presented in Sect. 4.1 demonstrated that the choice of TF has a large impact on the effectiveness of NE. This is an intuitive result as it is likely that particular TFs are more or less suited suited to given tasks; it appears to mirror the ‘No Free Lunch’ theorem [47] but concerning TFs. However, although intuitive, it is a significant result as a user is unlikely to know, in advance of training, which TF is most suited to a given task.

A further interesting and unexpected result from the first experiment is that, in many cases, the Heaviside step function was found to be the most effective TF; particularly for CGPANN. The step function was the original TF used by the McCulloch and Pitts neuron models [20]. The fact that the step function is incompatible with back propagation algorithms and is only suited to tasks with binary outputs is probably the reason other TFs have been favoured. Here, however, it has been shown that when using NE the Heaviside step function is still a suitable TF for contemporary ANNs provided the task is compatible with binary outputs.

The second experiment demonstrated that allowing NE to evolve heterogeneous ANNs produced better results, on average, than the average result obtained by evolving homogeneous ANNs of each TF. This is significant because, as the first experiment demonstrates, the choice of TF has a large impact on the effectiveness of the final ANNs. This, coupled with the fact there is no way of knowing which TF will be most suited to a given task before training begins, puts homogeneous ANN at a disadvantage. The importance of this result is heightened by the fact that the majority of NE methods are probably capable of evolving heterogeneous ANNs. The evolution of heterogeneous ANNs may even be further improved by the inclusion of additional TFs not considered here; and as NE places no restrictions on the types of TFs used, the range of possible TF is limitless. It may even be the case that certain TFs complement each other while others may not.

In the application of CNE to the Ball Throwing and Proben1:Cancer1 (training) benchmarks, evolving heterogeneous ANNs produced equal and better average fitnesses than for any of the homogeneous ANNs investigated. This indicates that evolution may be positively combining TFs to produce better results than could be achieved using each TF individually.

A further result from the second experiment concerns the percentage of neurons which used each type of TF in the evolved heterogeneous ANNs. Interestingly, it was never the case that one type of TF strongly dominated the networks. If this had occurred it would have indicated that evolution has found a particular TF to be the most suited toward the given task. There was, however, reasonable variation in the percentages of each type of TF used for CNE applied to the Full Adder and Two Spirals benchmarks and CGPANN applied to the Monks Problem1 benchmark. This shows that in curtain conditions evolution is providing some form of pressure to use a particular type of TF i.e. it is not simply random. The fact that a particular TF was not favoured in many cases may also indicate that evolution is combining the functionality of all the TFs.

The third experiment demonstrated that, in the majority of cases, using NE to optimise parameters associated with each neuron provided superior results compared to using using non-parametrised TFs. This is also an important result as the inclusion of an additional gene (or genes) which alter the characteristics of each neuron’s TF is again probably compatible with all NE methods.

It was also seen in the third experiment that the logistic TF benefited from being parametrised significantly more than the Gaussian TF. This was despite the non-parametrised Gaussian TF producing worse results than the logistic TF overall i.e. there was more room for improvement. It appears that the logistic TF strongly benefits from being parametrised. It could be the case however that this result is dependent on the range of values the TF parameters can take; this was not investigated in this paper. For instance the Gaussian TF may work best over a much smaller or much larger range of values.

In Sects. 4.2 and 4.3 evolving heterogeneous ANNs and evolving parameters associated with each neuron’s TF were individually shown to be beneficial for NE. However, in Sect. 4.4 it was show that when combined they produced no benefit, on average, than evolving heterogeneous ANNs with fixed TF parameters. It may be the case that using parameterised TFs and heterogeneous ANN produce a similar benefits, however, using them together increases the search space dimensionality without increasing the density of good solutions. It could also be possible that performance depends subtly on evolutionary parameter settings so that when these methods are combined new parameter settings are required for optimum performance. Further research would be required to determine this.

As mentioned in Sect. 2, homogeneous ANNs comprised of logistic or Gaussian TFs can be universal approximators. This means that with the correct connection weight values and combination of TFs, standard homogeneous ANNs are capable of implementing everything which heterogeneous ANNs can also implement. However, this fact says nothing about how efficiently standard homogeneous ANNs can implement certain configurations. Where here the term efficiently refers to both the number of neurons required and the time needed to configure the ANNs. Therefore the fact that ANNs are universal approximators is not enough to be considered useful. It is also necessary that the ANNs can also be configured towards a given task. To this end it appears that heterogeneous ANNs are, on average, more efficiently configurable.

Although this paper has demonstrated using NE to evolve ANNs it only used a limited set of TFs and optimised only a single parameter associated with each neuron over a small range of values. Further research should therefore investigate additional TFs and allow for more complex TFs described by multiple parameters; such as those described in [2]. Although certain TFs have been shown to be universal approximators this tells us nothing about how “trainable”/“evolvable” they are. For instance other TFs, such as the step function, were demonstrated to produce better results than other universal approximator TFs.

6 Conclusion

The use of NE to optimise the weights and topology of ANNs is well established and offers a number of advantages over traditional training methods; such as back propagation. However, the use of NE to create heterogeneous ANNs has so far been under researched and under utilised. This paper has demonstrated the use of two methods for allowing NE to create heterogeneous ANNs. That is, selecting each neuron’s TF from a predetermined list of TFs or by optimising parameters associated with each neurons TF. The paper has also shown that the effectiveness of using NE to train homogeneous ANNs is highly dependent on the selected TF. Using NE to optimise each neuron’s TF has been empirically demonstrated to alleviate this issue.

The results presented in this paper are significant as the methods described for creating heterogeneous ANNs are likely to be compatible with all NE methods. That is many NE method could benefit from evolving heterogeneous ANNs.