1 Introduction

Machine learning is an active research area involving the development of methods for automated data mining and analysis, which aims to uncover useful knowledge for decision making applications in real-world domains (Bishop 2006; Witten et al. 2010). Classification is a central data mining task concerned with predicting the class of a given pattern based on its input attributes, using a well-constructed model (classifier) (Han et al. 2011a; Tan et al. 2005). Bank credit scoring, financial fraud detection, churn analysis, and targeted advertising are examples of well-known classification problems, in addition to applications in numerous, diverse fields, including bioinformatics, healthcare, and engineering (Han et al. 2011a; Tan et al. 2005; Witten et al. 2010).

The process of building a classifier consists of two stages. The training stage utilizes a training set of labeled patterns—i.e., a set of patterns along with their correct class labels—that should be sufficiently representative of the domain of interest. A classification algorithm uses the training set to construct a model that captures the relationships between the attributes of the input patterns and their corresponding class labels. Then, during the subsequent operating stage, the model is used to predict the class of new unlabeled patterns that were not present during the training stage.

Inspired by biological systems, artificial neural networks (NN) (Haykin 2008) are one of the most widely studied and applied models for pattern discrimination (classification). NNs are generally presented as systems of inter-connected computational units (neurons)—each takes inputs and produces an output—and a set of real-valued weights associated with inter-neuronal connections. The connectivity structure of a NN, the activation function of the neurons, and the connection weights determine the decision boundaries that separate patterns with different classes in the data space, and are used to determine the class of a new pattern. The most commonly used pattern discrimination NNs are feed-forward neural networks (FFNN) with a three-layer topology (structure). The structure consists of an input layer, a single hidden layer, and an output layer, with full connectivity between the neurons in one layer and the neurons in the following layer. In a FFNN, the size of the input and output layers are determined by characteristics of the dataset, while the number of neurons in the hidden layer is often manually determined by practitioners based on various heuristics involving the number of input and output neurons, the size of the training set, the expected number of training iterations, and the estimated difficulty of the problem at hand.

In general, much of the work in the NN literature has focused on studying and developing techniques for training (i.e., learning the connection weights of) a NN with a given user-defined structure. Allowing arbitrary feed-forward topologies and automatically optimizing the structure of a NN—based on a dataset at hand—can lead to more effective pattern classification models. The present paper is an extended version of the ANTS 2014 conference paper (Salama and Abdelbar 2014), where ANN-Miner, an Ant-based Neural Network structure learning algorithm, was introduced.

In this paper, we build on the work described by Salama and Abdelbar (2014) in six ways. First, we use the \(\hbox {ACO}_{\mathbb {R}}\) algorithm (Socha and Dorigo 2008) for training the NN structures produced by our ANN-Miner algorithm. \(\hbox {ACO}_{\mathbb {R}}\) is a state-of-the-art ant colony algorithm for continuous optimization problems, which has been applied to training NNs with the standard three-layer structure (Blum and Socha 2005; Socha and Blum 2007). We also compare ANN-Miner to standard \(\hbox {ACO}_{\mathbb {R}}\) with the three-layer structure. Second, we use the quadratic loss function as a more effective function for evaluating the quality of the candidate constructed NNs, compared to the simple accuracy quality evaluation function that was used by Salama and Abdelbar (2014). Third, we compare our ACO-based algorithms with a baseline greedy hill-climber (GHC). Fourth, the number of datasets used in the experimental evaluation is increased from 20 to 40. Fifth, we compare our proposed approach to three different state-of-the-art classifiers, namely the C4.5 decision tree induction algorithm, the Ripper classification rule induction algorithm, and support vector machines (SVM), as well as to two well-known baseline classifiers, namely one-nearest-neighbor and Naïve Bayes. Sixth, we compare our proposed approach to NEAT (Stanley and Miikkulainen 2002; Stanley et al. 2005b, 2009), a prominent evolutionary algorithm for evolving neural networks.

The remainder of the paper is organized as follows. An overview of the ACO meta-heuristic is given in Sect. 2. We discuss the ACO related work—generally in classification and specifically in NNs—in Sect. 3. A background on feed-forward neural networks, along with different techniques for NN weight learning, is provided in Sect. 4. In Sects. 5 and 6, we describe our ANN-Miner algorithm and its related variations. Section 7 presents a review of related neuroevolutionary methods. Sections 8 and 9 report our experimental methodology and results, respectively. We conclude with some general remarks and directions for future research in Sect. 10.

2 Ant colony optimization

Ant colony optimization (ACO) (Dorigo et al. 1999; Dorigo and Stützle 2004, 2010) is a meta-heuristic for combinatorial optimization problems, inspired by the behavior of natural ant colonies. The basic principle of ACO is that a population of artificial ants cooperate with each other to find the best path in a graph, analogously to the way that natural ants cooperate to find the shortest path between two points such as their nest and a food source (Dorigo and Stützle 2004, 2010; Dorigo et al. 1996).

In ACO, each artificial ant constructs a candidate solution to the target problem, represented by a combination of solution components in the search space. Ants cooperate via indirect communication, by depositing pheromone on the selected solution components for a candidate solution. The amount of pheromone deposited is proportional to the quality of that solution, which influences the probability with which other ants will use that solution’s components when constructing their solution. This contributes to the global search aspect of ACO algorithms. In addition, the probability of an ant choosing a solution component often also depends on the value of a heuristic function that measures the desirability of that component.

The global search aspect is also promoted by the fact that a population of ants will search for the best solution in parallel, thus exploring possibly different regions of the search space at each iteration of the algorithm. As a result of this global search, ACO is less likely to get trapped in local optima than conventional greedy algorithms, which increases the chances of finding a near-optimal solution in the search space. Note that in some widely used variations of the ACO procedure (e.g., \(\mathcal {MAX}\)\(\mathcal{MIN}\) Ant System (Stützle and Hoos 2000)), multiple ant solutions are created in an iteration of the algorithm, and then the ant with the best created solution updates the pheromone trail. We use this behavior of the \(\mathcal {MAX}\)\(\mathcal{MIN}\) Ant System in our ANN-Miner algorithm, as described in Sect. 5.2.

3 Related work on ACO for pattern classification

ACO has been successful in tackling the classification problem of data mining. A number of ACO-based algorithms have been introduced in the literature for learning different types of classification models, including classification rules, decision trees, Bayesian networks, and neural networks.

Ant-Miner (Parpinelli et al. 2002) is the first ant-based classification algorithm which discovers a classification model comprised of a list of IF–THEN classification rules. The algorithm has been followed by several extensions, such as Ant-Miner+ (Martens et al. 2007), FRANTIC-SRL (Galea and Shen 2006), cAnt-Miner (Otero et al. 2009), Multi-pheromone Ant-Miner (Salama et al. 2011, 2013), and recently cAnt-Miner\(_{\mathrm {PB}}\) (Otero et al. 2013; Otero and Freitas 2013).

ACDT (Boryczka and Kozak 2010, 2011) and Ant-Tree-Miner (Otero et al. 2012; Salama and Otero 2014) are two different ACO-based algorithms for inducing decision trees for classification. Salama and Freitas have recently employed ACO to optimize the dependency relationships in various types of Bayesian network classifiers, such as Bayesian network augmented naïve-Bayes (Salama and Freitas 2013b), Bayesian multinets (Salama and Freitas 2014b, 2015), and class Markov blankets (Salama and Freitas 2013, 2014a).

In the context of pattern classification neural networks, the ACO meta-heuristic was utilized for learning NN weights in two previous works. Liu et al proposed ACO-PB, a hybrid of the ant colony and back propagation (BP) algorithms, to optimize NN weights (Liu et al. 2006). They use ACO to search a discretized set of weight values, and then use BP to fine-tune the discrete weights found by ACO. Blum and Socha applied \(\hbox {ACO}_{\mathbb {R}}\), an ACO algorithm for continuous optimization (Socha and Dorigo 2008; Liao et al. 2014), to train feed-forward neural networks (Blum and Socha 2005; Socha and Blum 2007). We revisit \(\hbox {ACO}_{\mathbb {R}}\) in more detail in Sect. 4.2 as we use it in our experiments with our ANN-Miner algorithm.

Note that, to the best of our knowledge, ACO has not been previously applied to learning the structure of neural networks prior to the introduction of our ANN-Miner algorithm. For a comprehensive review of ACO algorithms in data mining, the reader is referred to the survey by Martens et al. (2011).

4 Feed-forward neural networks

One of the most popular and well-established methods for pattern classification are feed-forward neural networks (FFNN), which are neural networks in which the pattern of connections between neurons is acyclic. The most common FFNN topology is a three-layer structure in which neurons are arranged in an input layer, a hidden layer, and an output layer, with full connectivity between layers—i.e., the output of every neuron in a layer feeds in as an input to every neuron in the succeeding layer. The external input to the network feeds into the input layer, and the network’s external output is the output of the output layer.

Each neuron i is fairly simple and can be considered to be a simple circuit which receives r inputs \(o_1,\ldots ,o_r\) (these inputs may represent the outputs of neurons in the previous layer or may represent the network’s external inputs) and produces a single output \(o_i\):

$$\begin{aligned} net_i= & {} \sum _{j=1}^r w_{ij} o_j + \theta _i \end{aligned}$$
(1)
$$\begin{aligned} o_i= & {} f(net_i) \end{aligned}$$
(2)

where each input \(o_j\) is the output of a neuron in the previous layer, the weight \(w_{ij}\) represents a real-valued weight between neuron j and neuron i, \(\theta _i\) represents a weight associated with neuron i itself called the neuron’s self-bias, and f is a nonlinear activation function. Note that input neurons do not have self-biases.

After an input pattern x is presented to the network, the output of the network is observed and is referred to as the actual output vector \(y^{\prime }\). A discrepancy function E is used to compare the target output y to the actual output \(y^{\prime }\) resulting in a scalar error value. A common discrepancy function is the simple sum of squared error:

$$\begin{aligned} E = \sum _{p\in P} E_p \end{aligned}$$
(3)

where P is the set of training patterns and

$$\begin{aligned} E_p = \frac{1}{2} \sum _{i=1}^m (y_i - y_i^\prime )^2 \end{aligned}$$
(4)

where m is the number of classes.

In pattern classification applications, the target vector y is m-dimensional where m is the number of classes. For a pattern with class label \(\hat{c}\):

$$\begin{aligned} y_k = {\left\{ \begin{array}{ll} 1 &{} \text{ if } k=\hat{c}\\ 0 &{} \text{ otherwise }\\ \end{array}\right. } \end{aligned}$$
(5)

The weights and self-biases of a given FFNN are collectively referred to as the network’s weight vector w. For example, a FFNN with four neurons in the input layer, five neurons in the hidden layer, and three neurons in the output layer would have a weight vector of 43 real numbers. If the weight vector for a given network is fixed, then the output of the network is a function of its input, and the total error E of the network is a mathematical function of the training set. If the training set is fixed, then the error E is a function of the weight vector w. Our objective is to find the value of the weight vector w which minimizes the error E.

4.1 Backward error propagation

One of the earliest, and still most popular, approaches to neural network learning is based on gradient descent. For each element \(w_i\) of the weight vector, the partial derivative \(\frac{\partial E}{\partial w_i}\) represents \(w_i\)’s contribution to the network error. Therefore, the gradient-descent principle is that each \(w_i\) should be changed by an amount \(\Delta w_i\),

$$\begin{aligned} w_i = w_i + \Delta w_i \end{aligned}$$
(6)

such that:

$$\begin{aligned} \Delta w_i \propto - \frac{\partial E}{\partial w_i} \end{aligned}$$
(7)

This can be implemented as:

$$\begin{aligned} \Delta w_i = - \eta \frac{\partial E}{\partial w_i} \end{aligned}$$
(8)

where \(\eta \) is referred to as the learning rate.

The back propagation (BP) algorithm (Werbos 1994) provides a mechanism for computing \(\frac{\partial E}{\partial w_i}\) for each element \(w_i\) in the weight vector by first computing the contribution of each neuron in the output layer to the error, and then propagating backwards through the network to compute the contribution of each neuron and weight in previous layers to the error.

4.2 Using \(\hbox {ACO}_{\mathbb {R}}\) to learn NN weights

The \(\hbox {ACO}_{\mathbb {R}}\) algorithm has recently been applied to learning the weights of a fixed-topology FFNN (Blum and Socha 2005; Socha and Blum 2007). In this application of \(\hbox {ACO}_{\mathbb {R}}\), an archive of L previous solutions is maintained, where a solution in this context refers to an instantiation of the weight vector w. In \(\hbox {ACO}_{\mathbb {R}}\), the solution archive plays the role that pheromone plays in other ACO algorithms.

In each iteration, each ant in the colony generates a candidate solution, again where each candidate solution is an instantiation of the weight vector. If there are m ants in the colony, then m solutions (weight vectors) are generated per iteration, and these m solutions are added to the archive, which is now temporarily of size \((L+m)\). The worst m solutions in the archive are then identified and discarded, thereby returning the archive to size L.

To evaluate a candidate solution (weight vector) w, the weight vector w is used to initialize the weights of a neural network. The training set is then applied to the network, and the total error over the training set is taken as a measure of the quality of the candidate solution—the lower the error, the better the solution.

At the start of each iteration, the solutions in the archive are sorted by quality, with the best solution being given a rank of 1 and the worst a rank of L. Each solution \(s_i\) of rank i is given a coefficient \(\omega _i\) computed as:

$$\begin{aligned} \omega _i = g(i; 1, qL) \end{aligned}$$
(9)

where g denotes the Gaussian function:

$$\begin{aligned} g(x;\mu ,\sigma ) = \frac{1}{\sigma \sqrt{2\pi }}e^{-\frac{(x-\mu )^2}{2\sigma ^2}} \end{aligned}$$
(10)

This means that the coefficient \(\omega _i\) is assigned to be the value of the Gaussian function with argument i, mean 1.0, and standard deviation equal to qL, where q is a user-supplied parameter. Note that smaller values of q cause the better ranked solutions to have higher coefficients \(\omega \). Further, note that:

$$\begin{aligned} \omega _1 > \omega _2 > \cdots > \omega _L \end{aligned}$$
(11)

Next, let us consider how an ant constructs a solution. Let u denote the solution being constructed. Recall that u is a weight vector whose dimensionality depends on the topology of the network. The first step to constructing u is to select one of the L solutions in the archive by which to be influenced in the construction process. If the r-th solution in the (sorted) archive, of rank r, is denoted \(s_r\), then a solution is selected based on:

$$\begin{aligned} {\text {Pr}}(\mathrm{select\ }s_a) = \frac{\omega _a}{\sum _{r=1}^L \omega _r} \end{aligned}$$
(12)

Let \(s_a\) be the archive solution that is selected according to Eq. (12). Each element of u is then generated by sampling the Gaussian probability density function (PDF):

$$\begin{aligned} u_j \sim N(s_{aj},\sigma _{aj}) \end{aligned}$$
(13)

where \(N(\mu ,\sigma )\) denotes the Gaussian PDF with mean \(\mu \) and variance \(\sigma ^2\), \(s_{aj}\) represents the value of the j-the element of the solution \(s_a\) selected using Eq. (12), and \(\sigma _{aj}\) is computed as:

$$\begin{aligned} \sigma _{aj} = \xi \sum _{r=1}^L \frac{| s_{aj} - s_{rj} |}{L-1} \end{aligned}$$
(14)

where \(\xi \) is a user-supplied parameter of the algorithm which plays a role similar to evaporation rate in other ACO algorithms. The higher the value of \(\xi \), the slower the speed of convergence of the algorithm.

Once each ant has constructed its solution, the archive is updated as described above. The process repeats until the desired termination criteria are met.

5 A novel ACO algorithm for learning neural network structures

Unlike many neural network applications that use a simple three-layer network topology with full connectivity between layers (as discussed in Sect. 4), we allow our ACO-based technique to deviate from this, as follows. ANN-Miner allows connections to be generated between hidden neurons and other hidden neurons—under the restriction that the topology remain acyclic—as well as direct connections between input neurons and output neurons. This permits the production of networks with a variable number of layers, as well as arbitrary connections that skip over layers. The ACO elements of the ANN-Miner algorithm are defined in the following subsections.

5.1 ACO construction graph

In general, the core element of an ACO-based algorithm is the construction graph, which contains the solution components in the search space, and with which an ant constructs a candidate solution. In the case of the problem at hand, a candidate solution is a network structure, and the solution components are the selected connections between the neurons. The number of input neurons and output neurons depends of course on the dataset and the representation that is used for the attributes of the dataset, while the total number of hidden neurons is a user-supplied parameter. Suppose the total number of neurons is N, with \(N_i\) input neurons, \(N_o\) output neurons, and \(N_h\) hidden neurons. The set of available potential connections, denoted C, will then be comprised of four types of potential connections:

  1. 1.

    Connections between input and hidden neurons (specifically \(N_i\times N_h\) connections),

  2. 2.

    Connections between hidden and output neurons (specifically \(N_h\times N_o\) connections),

  3. 3.

    Connections between input and output neurons (specifically \(N_i\times N_o\) connections),

  4. 4.

    Connections between different hidden neurons.

The available connections between the \(N_h\) hidden neurons are defined as follows. In order to ensure that the network structure is acyclic, we impose the restriction that the connection \((n_i \rightarrow n_j)\) is not available if \(n_i \ge n_j\). In other words, each hidden neuron has a numeric index, and we only allow connections from a given hidden neuron \(n_i\) to a higher-numbered neuron \(n_j\). It is well known that any directed acyclic graph is isomorphic to a graph where the nodes are lexicographically ordered, and for all arcs (uv) in the graph, u precedes v in the lexicographic order. Hence, the number of available connections between the \(N_h\) hidden neurons is

$$\begin{aligned} (N_h-1)+(N_h-2)+\cdots +1+0=N_h(N_h-1)/2 \end{aligned}$$

As previously mentioned, the number of input and output neurons, \(N_i\) and \(N_o\), are determined by the characteristics of the dataset. In the work described in this paper, we set the number of available hidden neurons to \(N_h=N_i+N_o\).

It is interesting that our ACO procedure can be viewed as pruning the maximally connected network structure that contains all |C| possible connections by selecting which connections to include in the network structure and which connections to exclude. Note that using the maximally connected NN structure may harm the generalization ability of the produced NN. That is, a NN model with a large number of connections can potentially overfit the training (in-sample) data during the training phase, by capturing noisy relationships related only to the training set. Consequently, this model might not perform well on new (out-of-sample) data.

Hence, each potential connection \(c = n_i \rightarrow n_j\), connecting neuron i to neuron j, has two solution components in the construction graph: \(D_c^{true}\), representing the decision to include connection \(n_i \rightarrow n_j\) in the current candidate NN structure being constructed by the ant, and \(D_c^{false}\), representing the decision not to include the connection. Each solution component \(D_c^a\) is associated with a pheromone amount (indirectly representing an estimate of the quality of this component in constructing effective candidate NN models). Therefore, the construction graph can be represented as a two-dimensional \(2 \times |C|\) array, consisting of an element \(D_c^a\) for every \(c=1,\ldots ,|C|\), and \(a\in \{false,true\}\).

5.2 A high-level view of the ANN-miner algorithm

The pseudo-code of the ANN-Miner algorithm is given in Algorithm 1. In the initialization step of ANN-Miner (line 4), the amount of pheromone assigned to each solution component \(D_c^a\)—where a can be true or false—in the construction graph is initialized with the value 0.5. Hence, for each connection c, the probability of including \(i \rightarrow j\) (i.e., selecting \(D_c^{true}\)) in the topology equals the probability of not including \(i \rightarrow j\) (i.e., selecting \(D_c^{false}\)).

figure a

The pseudo-code of the ANN-Miner algorithm follows the approach of \(\mathcal {MAX}\)\(\mathcal{MIN}\) Ant System (Stützle and Hoos 2000), where in each iteration, each ant in the colony constructs a solution, and the ant with the best constructed solution updates the pheromone trail. In the inner for-loop (lines 8–15), each \(ant_i\) in the colony creates a candidate solution \(NN_i\), i.e., a complete neural network (line 9). Then the quality of the constructed solution is evaluated (line 10). The best solution \(NN_{tbest}\) produced in the colony is selected to update the pheromone trail by an amount that is proportional to the quality of its solution \(Q_{tbest}\). After that, the algorithm compares the iteration-best solution \(NN_{tbest}\) with the best-so-far solution \(NN_{bsf}\) (the if statement in lines 17–20) to keep track of the best solution found so far during the algorithm’s execution.

This set of steps is considered an iteration of the outer repeat-until loop (lines 5–22) and is repeated until the same solution is generated for a number of consecutive iterations specified by the conv_iterations parameter (indicating convergence) or until max_iterations is reached. The values of conv_iterations, max_iterations and colony_size are user-specified. The parameter settings used in our experiments are shown in Sect. 8.2.

The best-so-far neural network undergoes an (optional) post-processing step (line 23) to produce the final neural network \(NN_{final}\) to be returned by the algorithm. Basically, the algorithm learns the final weights of the connections in the neural network \(NN_{bsf}\)—which uses the best NN structure found during the search process of the ACO algorithm. This is discussed in Sect. 6.2.

5.3 Creating a candidate solution

figure b

Algorithm 2 describes the process of creating a new candidate solution (neural network), which is called as a subroutine in line 9 of Algorithm 1. The procedure starts with an empty (edge-less) neural network (line 2). For each connection c in the available set of connections C, the ant selects \(D_c^a\) to decide whether to include this connection in the candidate network NN or not (line 3)—by either selecting solution component \(D_c^{true}\) or \(D_c^{false}\). The selection of the solution component at each step is based on the following probabilistic state transition formula:

$$\begin{aligned} p(D_c^a) = \frac{\tau \left( D_c^a\right) }{\tau \left( D_c^{true}\right) +\tau \left( D_c^{false}\right) } \end{aligned}$$
(15)

where \(p(D_c^a)\) is the probability of selecting decision \(D^a\) for connection c, and \(\tau (D_c^a)\) is the current amount of pheromone associated with \(D_c^a\). Note that, in this ACO algorithm, we do not use any heuristic information, that is, the probability of selecting a solution component is solely dependent on the current pheromone amounts associated with each component.

If the selected component is \( D_c^{true}\), that is, the ant selected the decision to include connection c in the NN structure, the corresponding connection \((n_i \rightarrow n_j)_c\) is appended to the candidate network NN (the if statement in lines 5–7). After the ant visits all the available connections in the construction graph and performs the include-or-not decision, the network structure of NN is now complete, and the weights of the neural network are ready to be learned. If a given neuron i either does not have an incoming path from any of the network inputs or does not have an outgoing path to any of the network outputs, then that neuron i is not included in the constructed network. If it happens that one of the network outputs does not have an incoming path from any of the network inputs, then the constructed network is assigned a poor quality evaluation without applying the BP weight-training process.

We train the neural network NN (line 9) using the back propagation (BP) procedure (described in Sect. 4.1), with some optimized parameter values (discussed in the following section), as a “quick and dirty” method to obtain a complete neural network and evaluate its pattern classification quality. We use BP for training the candidate neural network, not because it is the best weight optimization method, but because it is a fast procedure that is going to be repeated many times over the course of the computation. In addition, we are only interested in the relative quality difference between different NN structures trained by the same (even if not very efficient) BP procedure.

5.4 Evaluating the quality of a candidate solution

A key objective of a pattern classification algorithm is to learn models with good generalization capabilities, i.e., models that are able to accurately predict the class labels of new unknown data patterns. Overfitting occurs when the induced model has good classification performance (fit) on the training (in-sample) data used in the learning process, yet shows bad predictive performance (generalization) on new/testing data. Therefore, we split the training set \({\mathcal T}\) at the beginning of the algorithm into two mutually exclusive parts: 1) the learning set \({{\mathcal {T}}_l}\), which contains 80 % of the training set and is used to learn the candidate NN structure and weights (line 9, Algorithm 2); and 2) the validation set \({{\mathcal {T}}_v}\), which contains 20 % of the training set and is used to evaluate the quality of the model (line 10, Algorithm 1).

Let m denote the number of classes, \(\hat{c}\) denote the true (correct) class for a given pattern x, and c denote the class that is predicted by the neural network NN (i.e., \(c={{\mathrm{arg\,max}}}_{i}\{o_i\}\)). Recall from Sect. 4 that we use y to denote the m-dimensional target output vector of the network, and use \(y^\prime \) to denote the actual output vector, where \(y^\prime =(o_1,o_2,\ldots ,o_m)\). The output vector can be transformed into a vector p of class probability scores through a simple normalization:

$$\begin{aligned} p_k=\frac{o_k}{\sum _{j=1}^mo_j} \end{aligned}$$
(16)

A simple, and perhaps the most widely used, classification measure is accuracy. This is the quality measure that was used in previous work on ANN-Miner (Salama and Abdelbar 2014). For a given pattern x,

$$\begin{aligned} Acc(NN|x) = \left\{ \begin{array}{lll} 1\quad &{} \mathrm{if } &{} \hat{c}=c \\ 0 \quad &{} \mathrm{if } &{} \hat{c}\ne c \end{array} \right. \end{aligned}$$
(17)

For an entire validation set \({{\mathcal {T}}_v}\),

$$\begin{aligned} Q_{Acc}(NN|{{\mathcal {T}}_v})= \frac{1}{|{{\mathcal {T}}_v}|} \sum _{x\in {{\mathcal {T}}_v}} Acc(NN|x) \end{aligned}$$
(18)

where \(Q_{Acc}\) is the pheromone amount to be deposited in the pheromone update step (line 16, Algorithm 1); the higher the value of \(Q_{Acc}(NN|{{\mathcal {T}}_v})\), the better the quality of NN.

The accuracy measure has a deficiency, which can be illustrated using the following example. Suppose we have a pattern x where \(m=3\) and \(\hat{c}=1\). Consider three candidate NNs with the following probability score vectors given pattern x: \(NN_1(x)=(0.9, 0.1, 0)\), \(NN_2(x)=(0.6, 0.4, 0)\) and \(NN_3(x)=(0.6, 0.2, 0.2)\). In the three NNs of the example, Acc(x) will be equal to 1 for all three probability vectors. However, it is obvious that \(NN_1\) should receive a better quality preference than \(NN_2\) and \(NN_3\), since it produces a higher probability for the true class.

In this work, we use the quadratic loss function (QLF), which is a widely used error measure, to evaluate the quality of constructed candidate NN models. For a given pattern x:

$$\begin{aligned} QLF (NN|x) = \sum _{k=1}^m \left[ y_k - y_k^{\prime }\right] ^2 \end{aligned}$$
(19)

Because components of the y vector are equal to 1 for the correct class and to 0 for all other classes, this equation can be rewritten as:

$$\begin{aligned} QLF(NN|x) = (1- p_{\hat{c}})^2 + \sum _{k\in [1,m]:k\ne c} (p_k)^2 \end{aligned}$$
(20)

In the same aforementioned example, the three probability vectors would have QLF values of: 0.02, 0.32, and 0.24. Thus, the QLF error measure would prefer \(NN_1\), followed by \(NN_3\), followed by \(NN_2\). Thus, not only does QLF favor the models that produce a higher probability for the true class, but it also favors the models that produce the lowest probabilities for the other classes.

For an entire validation set \({{\mathcal {T}}_v}\),

$$\begin{aligned} Q_{QLF}(NN|{{\mathcal {T}}_v}) = 1-\frac{1}{|{{\mathcal {T}}_v}|} \sum _{x\in {{\mathcal {T}}_v}} QLF(NN|x) \end{aligned}$$
(21)

where \(Q_{QLF}\) is the pheromone amount to be deposited in the pheromone update step.

5.5 Updating pheromone trails

After the quality \(Q_i\) is computed for each candidate solution \(NN_i\) constructed by all the ants in the colony at iteration t, the iteration-best solution is identified and used to update the pheromone amounts on the construction graph. The pheromone amounts are increased on all the components \(D_c^a\) of the solution constructed by the iteration-best ant during its trail, where \(D_c^a\) represents the decision to include (\(a=true\)) or not to include (\(a=false\)) connection c in the NN structure. This influences the probability for the subsequent ants to include, or not to include connection c. The amount of pheromone deposited is based on \(Q_{tbest}\), the quality of the iteration-best solution \(NN_{tbest}\), as follows:

$$\begin{aligned} \tau (D_c^a)= \tau (D_c^a)+ [\tau (D_c^a) \times Q_{tbest}] \forall c\in C, D_c^{true} \in NN_{tbest} \end{aligned}$$
(22)

To simulate pheromone evaporation, normalization is then applied on each pair of solution components associated with each connection c in the construction graph. This keeps the total pheromone amount on each pair \(\tau (D_c^{true})\) and \(\tau (D_c^{false})\) equal to 1, as follows:

$$\begin{aligned} \tau (D_c^a)= \frac{\tau (D_c^a)}{ \tau (D_c^{true})+\tau (D_c^{false})} \ \ \ \forall c \in C \end{aligned}$$
(23)

6 Variations of the ANN-miner algorithm

6.1 Accumulated wisdom

We present two variations of the ANN-Miner algorithm: the first variation is the standard ANN-Miner algorithm described in Sect. 5, and the second is a variation called wANN-Miner that makes use of the optimized weights of the best-so-far network. In both variations, after each \(ant_i\) constructs a candidate network structure \(NN_i\), the neural network is trained using the BP procedure to learn the weights of its connections.

In the standard ANN-Miner algorithm, after \(ant_i\) constructs a candidate network structure \(NN_i\) (where \(NN_i\) consists of a set of inter-neuronal connections without associated weights), the weights of \(NN_i\) are randomly initialized. This means that the algorithm does not make use of the optimized weights of previously constructed neural networks. Such an approach performs a fair comparison between different candidate NN structures, since they all start weight optimization from the same point: random initialization of the weights. In ANN-Miner, we perform BP for each candidate \(NN_i\), for 20 epochs and with a learning rate of 0.1. Moreover, the weight-learning procedure in the post-processing step (described in the following subsection) also starts with randomly initialized weights.

In contrast, the second variation, called wANN-Miner, makes use of the optimized weights of the best-so-far neural network \(NN_{bsf}\) constructed in previous iterations. In other words, in wANN-Miner, the colony retains the weight optimization “wisdom,” and accumulates on it throughout the algorithm’s execution. Specifically, in this variation, \(NN_{bsf}\) consists of a set of inter-neuronal connections along with their associated weights. After each ant constructs a candidate network structure \(NN_i\), the weights of its connections are not randomly initialized but rather are initialized with the weights present in \(NN_{bsf}\). However, some connections in \(NN_i\) may not be present in \(NN_{bsf}\), in which case their weights will be randomly initialized. Furthermore, there may be some connections in \(NN_{bsf}\) that are not present at all in \(NN_i\). Such differences in the NN structures maintain the exploration aspect of the weight-learning process, in addition to the exploitation aspect that is realized by building on the best weights learned in previous iterations.

The back propagation procedure is then applied to \(NN_i\); if \(NN_i\) produces a better classification quality than \(NN_{bsf}\), \(NN_i\) will replace \(NN_{bsf}\), and its connection weights will be used as initial values in constructing subsequent candidate neural networks.

In wANN-Miner, we perform BP, for each candidate \(NN_i\), for only 10 epochs and with a lower learning rate of 0.05, making use of the accumulated weight optimization wisdom. Moreover, the BP weight-learning procedure in the post-processing step also starts with the weights of \(NN_{bsf}\).

6.2 Post-processing procedure

The ANN-Miner algorithm performs a final step to learn the connection weights of the \(NN_{bsf}\) optimized structure produced by the ACO procedure. We use the two NN weight-learning algorithms discussed in Sect. 4:

  1. 1.

    the standard gradient-descent-based back propagation algorithm,

  2. 2.

    the ant-based \(\hbox {ACO}_{\mathbb {R}}\) algorithm for continuous optimization.

Furthermore, we also evaluate baseline variations in which no post-processing is applied, and the network \(NN_{bsf}\) is returned without any further weight optimization. The idea behind that is to test the hypothesis that wANN-Miner may not benefit from the weight-learning post-processing step to the same extent that the first variation, ANN-Miner, may benefit. This is demonstrated in the results in Sect. 9.

7 Review of related neuroevolutionary methods

Evolving neural network topologies and weights has been of substantial interest to the evolutionary computation community since the 1990s. There are several useful surveys of this area: (Floreano et al. 2008) is a broad survey; (Schliebs and Kasabov 2013) focuses on evolving spiking neural networks; and (Risi and Togelius 2014) focuses on neuroevolution in games.

Some neuroevolutionary methods, including our own ANN-Miner, use a direct topology representation, in which the candidate solution representation includes decision variables to control the existence of every potential connection, and possibly every potential node. Other approaches use an indirect representation, for example an evolvable set of rules or a grammar, to indirectly specify the network topology. Numerous works have employed indirect encodings (Stanley et al. 2009; Hornby and Pollack 2002; Cangelosi et al. 1994; Kodjabachian and Meyer 1998; Stanley 2007; Clune et al. 2009; Valsalam and Miikkulainen 2011; Valsalam et al. 2012). We focus on direct encoding-based methods in this review, since they are more similar to our work.

We suggest a four-category classification of neuroevolutionary methods. The first two categories (which we will call Type I and II) are methods that evolve both the network topology and the weights, without the use of gradient descent. Type I methods are ones that employ some type of crossover operator, and Type II methods are ones that do not. Type III methods evolve the network topology, but use some type of gradient-descent approach to optimize the weights, in order to evaluate the fitness of each constructed topology. Type IV methods evolve the weights for a fixed-topology network. We will consider each of these four types in turn in the following subsections.

7.1 Type I methods

Because Type I methods employ some type of crossover operator, they face challenges not faced by Type II methods. One such challenge is what has been called the “competing conventions” problem (Whitley et al. 1993; Stanley and Miikkulainen 2002). This problem refers to there being multiple ways to express what is intuitively the same network. For example, the two networks in Fig. 1 are logically equivalent, but appear to be different. Although both have the same fitness, crossover between them could result in a very poor fitness solution.

Fig. 1
figure 1

A practical illustration of the competing conventions problems. The two networks shown are logically the same, but appear to be different

Another problem that is shared by Type I and II methods is that when the network structure is changed through crossover or mutation, by adding or removing an edge or a node, the new network often has initially low fitness. After the weights have had some time to adapt to the new structure, the “true” fitness of the new structure can reveal itself. However, evolutionary pressure will often force the newly created structure to be removed from the population before its weights have had a chance to “catch-up.”

This problem is not faced by Type III methods, because they use gradient descent to quickly adapt the weights while computing the fitness of a structure. For Type I and II methods, there is a need to protect a newly created or newly modified structure until its weights have had a chance to adapt.

Prominent in the Type I family of methods is Stanley et al.’s NeuroEvolution through Augmenting Topologies (NEAT) algorithm (Stanley and Miikkulainen 2002; Stanley et al. 2005b), and its variations (Stanley 2007; Stanley et al. 2009; Whiteson et al. 2005). NEAT protects newly created structures through an elaborate mechanism for speciation or niching (Potter and De Jong 1995). NEAT also includes an interesting approach to the competing conventions problem (see below).

NEAT’s solution representation, or genome, includes two types of genes: node genes and connection genes. Each node or potential node in the network has a corresponding node gene. NEAT starts off with a small network consisting only of the input neurons and output neurons, with no hidden neurons. Hidden neurons are then added gradually over the course of the computation. A connection gene specifies a single connection; the gene representation includes: the source node, the destination node, the weight, an “enable bit” that indicates whether or not that connection exists in the network, and an innovation number that we describe further below. NEAT’s genome includes a list of node genes, and a list of connection genes.

Connection weights are adapted by mutation as in most evolutionary algorithms, with each connection being perturbed with some probability in each generation. Structural mutations, which affect the network topology, expand the genome by adding genes. There are two types of structural mutation. In one type, a single new connection gene with a random weight is added connecting two previously unconnected nodes. In another type, an existing connection between two nodes a and b is split into two connections, with a new node c being inserted in between. More specifically, the single connection from a to b with weight w is replaced with a connection from a to the new node c with weight 1, and a second connection from c to b with weight w.

Whenever a new gene is created through structural mutation, it is assigned a unique serial ID called the innovation number, which provides a mechanism for tracking the historical origin of a gene. The innovation number is immutable: it is not changed by mutation, and in the case of crossover, a gene crosses over with its innovation number intact.

The historical information captured by the innovation numbers provides a way to implement crossover in a way that minimizes the impact of the competing conventions problem. When crossover is to be performed on two parent genomes P and Q, the genes in both genomes with the same innovation numbers are identified and are called matching genes. In constructing the offspring, genes are randomly chosen from either parent at matching genes, while nonmatching genes are always taken from the more fit parent.

The percentage of matching genes between two genomes P and Q in the population can be used to produce a distance measure \(\delta \) that measures the degree of similarity between P and Q. Genomes whose distance from one another is less than some compatibility threshold \(\delta _t\) are taken to be members of the same species. To prevent overlap between species, in each generation, a genome is placed in the first species that it is found to be compatible with. In this way, the population can be partitioned in each generation into species, with crossover taking place almost exclusively within a species, although interspecies crossover is allowed with a very low probability.

To prevent a high-fitness species from dominating the entire population, a fitness sharing (Goldberg and Richardson 1987) mechanism is applied: If a genome P has an actual fitness of x, then its adjusted fitness is set to x / n where n is the number of members of P’s species in the current generation. The number of offspring each species S is allocated in generation \((t+1)\) is determined in proportion to the sum of the adjusted fitness of the members of S in generation t. Each species S then first eliminates the least-fit members of S, then the surviving members reproduce to create S’s assigned-share of the population of generation \((t+1)\).

NEAT has been applied primarily to reinforcement learning problems, including its use in games (Stanley and Miikkulainen 2004; Stanley et al. 2005, 2005b)—although recent work (Sohangir et al. 2014) has applied it to classification. Extensions to NEAT include a real-time version (Stanley et al. 2005b), a variant called HyperNEAT which uses an indirect hypercube representation to specify the topology (Stanley et al. 2009), and a variation for evolving gene regulatory networks (Cussat-Blanc et al. 2015).

Also within the Type I Family is a hybrid approach (Yu et al. 2007) which used a PSO variation that includes crossover to optimize the weights and topology of single-hidden-layer feed-forward networks, and the work of others who used conventional genetic algorithms (Castillo et al. 2000; Whitley et al. 1993).

7.2 Type II methods

Type II methods avoid the competing conventions problem entirely by not employing crossover altogether, relying instead on mutation/perturbation operators. Evolutionary computation approaches within this family include evolutionary programming approaches (which use mutation alone without crossover) (Palmes et al. 2005; McDonnel and Waagen 1993; Gutiérrez et al. 2011; Fogel 1993; Ang et al. 2008; Fang and Xi 1997). Some used evolutionary programming approaches where the mutation parameters are controlled by an annealing temperature (Angeline et al. 1994; Leung et al. 2003), or by other heuristic measures (Oong and Isa 2011).

Several researchers (Chan et al. 2013; Yu et al. 2008) have used PSO approaches, without crossover, to optimize the weights and structure of single-hidden-layer feed-forward networks.

7.3 Type III methods

Type III methods use evolutionary computation (broadly defined) to evolve the network topology, but use some type of gradient descent to optimize the weights. This is the approach that we follow in the ANN-Miner algorithm. In such methods, fitness evaluation requires running a gradient-descent algorithm (such as back propagation) within the fitness function, typically starting from randomly initialized weights. Examples of this approach include several works (Yao and Liu 1997; Whitley et al. 1990; Martínez-Estudillo et al. 2005). Note that we include in this category methods which optimize weights using a combination of a gradient-based method and another method. For example, one approach (Yao and Liu 1997) used a combination of gradient descent and simulated annealing. Another approach (Martínez-Estudillo et al. 2005) used an elaborate scheme where evolutionary computation operators are employed to obtain an initial set of weights, which are then clustered, and the Levenberg–Marquardt (LM) gradient-based method is applied to the best of each cluster.

7.4 Type IV methods

Type IV methods evolve the weights of a fixed-topology neural network. The \(\hbox {ACO}_{\mathbb {R}}\) (Socha and Dorigo 2008) algorithm can be considered to fall in this category. Other examples include numerous PSO-based approaches (Yeh 2013; Cai et al. 2010; Salerno 1997; Lu et al. 2003; Juang 2004; Settles et al. 2003; Dutta et al. 2013; Dehuri et al. 2012; Okada 2014; Song et al. 2007; Yeh et al. 2011; Han et al. 2011b; Lin et al. 2009), as well as methods that combine PSO with simulated annealing (Da and Xiurun 2005), in addition to approaches based on differential evolution (Ilonen et al. 2003; Garro et al. 2011) and cuckoo search (Valian et al. 2011; Nawi et al. 2013). Numerous conventional genetic algorithm approaches for evolving the weights of a fixed-topology network have also been explored (Gomez and Miikkulainen 1999; Saravanan and Fogel 1995; Yang and Kao 2001; Coshall 2009; Kang et al. 2010).

8 Experimental methodology

8.1 Comparative evaluation

In our experiments, we compare the predictive performance of several NN learning algorithms. First, as a baseline, we use the standard three-layer topology with back propagation (BP) for weight learning. This is referred to as 3L-BP. In addition, we also use the standard three-layer structure, but with \(\hbox {ACO}_{\mathbb {R}}\) for weight optimization, and refer to it as 3L-\(\hbox {ACO}_{\mathbb {R}}\). Furthermore, we use five variations of our proposed ant-based algorithm for optimizing NN structure (ANN-Miner, ANN-Miner-BP, wANN-Miner, wANN-Miner-BP, and wANN-Miner-\(\hbox {ACO}_{\mathbb {R}}\)). Each of our ANN-Miner variations is defined by: 1) whether it initializes the connection weights after each iteration or it memorizes the optimized weights throughout the algorithm, and 2) whether it uses a post-processing step of weight learning (and the algorithm utilized in this step) or not. Table 1 summarizes the NN learning algorithms used in the experiments.

figure c
Table 1 Neural network learning algorithms used in the experiments

Moreover, we implemented a Greedy Hill-Climbing (GHC) approach to learn NN structures, using back propagation (BP) as a subroutine to learn NN weights. This is referred to as GHC-BP. The algorithm starts with a maximally connected multilayer NN structure containing all the possible connections between the network neurons. Then the algorithm attempts to prune the NN structure using the first-improvement approach, as follows. Iteratively, GHC-BP temporarily removes one connection from the NN structure, learns the weights of the new NN using BP, and evaluates its quality. If the quality improves (or does not change), this connection is removed permanently from the NN structure; otherwise the connection is returned back to the structure. Such an algorithm allows us to examine the effect of using the ACO meta-heuristic as a global search for optimizing the NN structures in comparison with using a greedy local search. The pseudo-code of GHC-BP is presented in Algorithm 3.

8.2 Experimental setup

The experiments were carried out using the stratified 10-times 10-fold cross-validation procedure. In essence, a dataset is divided into 10 mutually exclusive partitions (folds), with approximately the same number of patterns in each partition. Then each classification algorithm is run 10 times, where each time a different partition is used as the test set and the other nine partitions are used as the training set. The results are then averaged and reported as the accuracy rate of the NN classifier. Since we are evaluating stochastic algorithms, we run each 10 times—using a different random seed to initialize the search each time—for each of the 10 iterations of the cross-validation procedure.

The parameter configuration used in our experiments is shown in Table 2. For the sake of fairness of comparison, we limit each algorithm to the same fixed number of solution evaluations to construct a final NN classifier. In GHC-BP (Algorithm 3, line 6), the external parameter max_evaluations represents the maximum number of solution evaluations that the algorithm performs during the hill-climbing search. As can be seen in Table 2, it is set equal to max_iterations multiplied by colony_size, which is the maximum number of evaluations for ANN-Miner. However, note that the maximum number of evaluations might not be utilized completely. The ACO-based algorithms might use a smaller number of iterations if they converge earlier and the greedy algorithm might also stop earlier if the total number of connections in the fully connected NN structure (with all the connections) is less than max_evaluations.

Table 2 Parameter settings used in experiments

The performance of ANN-Miner was evaluated using 40 public-domain datasets from the well-known UCI (University of California at Irvine) dataset repository (Asuncion and Newman 2007). The main characteristics of the datasets are shown in Table 3.

Table 3 Characteristics of the datasets used in the experiments

In follow-up experiments, described in Sects. 9.4 and 9.5, we compare our approach to a number of well-known state-of-the-art and baseline classifiers, and to NEAT, a prominent neuroevolutionary technique.

9 Computational results

9.1 Predictive accuracy

Predictive accuracy results are reported in Table 4 for each of the algorithms under evaluation. These results represent the average predictive accuracy over 100 runs of the 10-times 10-fold cross-validation procedure described in Sect. 8, for each of the 40 datasets. For each dataset, the highest accuracy is shown in boldface. In addition, the last row of the table reports the average rank of each algorithm. For each algorithm g, the rank of g is first obtained for each dataset individually, and then the individual dataset ranks are averaged across the 40 datasets for each algorithm. In case two or more algorithms are tied for a given dataset, then the tied algorithms are given the average of the ranks that they span.

Table 4 Predictive accuracy (%) results for the algorithms under evaluation

From the table, we note that the best-ranking algorithm is wANN-\(\hbox {ACO}_{\mathbb {R}}\) with a rank of 1.88, followed closely by wANN-BP with a rank of 2.28. These are followed by ANN-BP in third place with a rank of 3.54, then by wANN in fourth place with a rank of 4.15. In fifth place is 3L-BP with a rank of 5.63. Finally, in the last three places, respectively, are ANN with a rank of 6.06, 3L-BP with a rank of 6.23, and the greedy GHC-BP with a rank of 6.25.

wANN-\(\hbox {ACO}_{\mathbb {R}}\) had the highest predictive accuracy in 22 of the 40 datasets, and wANN-BP had the highest accuracy in 20 datasets. wANN and ANN-BP had the highest accuracy in 6 and 5 datasets, respectively. 3L-BP and 3L-\(\hbox {ACO}_{\mathbb {R}}\) each had the highest accuracy in a single dataset, while ANN and GHC-GP did not have the highest accuracy in any datasets.

Table 5 reports the results of applying a nonparametric Friedman test with the Holm post-hoc test (Derrac et al. 2011), at the conventional 0.05 threshold, to compare all pairings of the eight algorithms under evaluation. The Friedman statistic \(\chi ^2_F\) is found to be 150.9 with seven degrees of freedom, corresponding to a p value of 9E\(-\)11. Thus, we can reject the null hypothesis and proceed with the post-hoc tests. For each pairing, we report the computed p value, and the corresponding Holm critical value. The difference between the two algorithms is statistically significant if the p value is less than or equal to the corresponding Holm threshold. Statistically significant p values are shown in boldface. We observe the following:

  • wANN-\(\hbox {ACO}_{\mathbb {R}}\) is significantly better than all the other algorithms, except for wANN-BP—which differs from it only in the type of post-processing that is employed.

  • The use of post-processing always results in a statistically significant improvement: wANN-\(\hbox {ACO}_{\mathbb {R}}\) and wANN-BP are both significantly better than wANN; ANN-BP is significantly better than ANN.

  • The use of “wisdom” results in a statistically significant improvement in accuracy when combined with \(\hbox {ACO}_{\mathbb {R}}\) post-processing, but not when combined with BP post-processing: wANN-\(\hbox {ACO}_{\mathbb {R}}\) is significantly better than ANN-\(\hbox {ACO}_{\mathbb {R}}\), but no statistically significant improvement was detected for wANN-BP over ANN-BP. Without post-processing: wANN is significantly better than ANN.

Table 5 Results of the Friedman test with the Holm post-hoc test, at the 0.05 significance threshold, for the predictive accuracy results reported in Table 4

9.2 Model size

It is also interesting to consider the model size, expressed as the number of connections in the neural network, produced by each of the algorithms under evaluation. The model size of the baseline 3L-BP and 3L-\(\hbox {ACO}_{\mathbb {R}}\) algorithms is of course fixed, and is shown in the second column of Table 6 (under the heading 3L). For each of the other algorithms, the table reports the ratio of the average number of connections (averaged over the 100 runs of the 10-times 10-fold cross-validation procedure) to the number of connections reported for 3L. The final row reports the average ratio for each algorithm. The network size, of course, does not depend on the type of post-processing that is employed, if any. Therefore, we report the network size results for ANN-Miner, which will be the same for ANN-Miner-BP. Similarly, we report the size results for wANN-Miner, which will be the same for wANN-Miner-BP and wANN-Miner-\(\hbox {ACO}_{\mathbb {R}}\).

Table 6 Model size (expressed as number of inter-neuronal connections) results for the algorithms under evaluation

From the table, we note that the average size ratio for wANN and ANN is only slightly larger than the baseline 3L—a ratio of 1.17 for wANN and 1.14 for ANN. The largest ratio (1.34) is obtained with the greedy GHC-BP.

9.3 Discussion

In this subsection, we perform a more detailed analysis of the results by comparing the effectiveness of different aspects of the proposed algorithm in improving the predictive quality of the produced NN classification models, as follows:

  • The “wisdom”-based variation: We note that the “wisdom”-based versions of ANN-Miner produce better accuracy over the corresponding standard versions. wANN-BP has better accuracy than ANN-BP in 28 out of the 40 datasets, and wANN has better accuracy than ANN in 31 out of the 40 datasets. The two versions of wANN-Miner with post-processing (either with BP or \(\hbox {ACO}_{\mathbb {R}}\)) have better predictive accuracy average ranks than the other algorithms under comparison. wANN-\(\hbox {ACO}_{\mathbb {R}}\) is significantly better, in predictive accuracy, than each of the other algorithms, except for wANN-BP.

  • Weight-learning post-processing: It is interesting to consider whether wANN-Miner benefits from weight-learning post-processing to the same extent as ANN-Miner. wANN-BP has better accuracy than wANN in 32 datasets, worse in zero datasets, and the same in eight datasets. On the other hand, ANN-BP has better accuracy than ANN in all 40 datasets, without any ties. Thus, both variations benefit from post-processing (which is hardly surprising), but the accumulative wisdom variation benefits slightly less. Regarding the weight-learning algorithm used in the post-processing step: wANN-\(\hbox {ACO}_{\mathbb {R}}\) has better accuracy than wANN-BP in 20 datasets, worse in 13 datasets, and the same in 7 datasets.

  • ACO versus greedy search: Comparing wANN-BP to the greedy GHC-BP, we find that wANN-BP has better accuracy in 39 datasets, and worse in a single dataset. All the variations of ANN-Miner that include post-processing (i.e., wANN-\(\hbox {ACO}_{\mathbb {R}}\), wANN-BP, and ANN-BP) had better accuracy than GHC-BP to a statistically significant extent. Even without post-processing, wANN had significantly better accuracy than GHC-BP. The only variation for which a statistically significant improvement over GHC-BP was not detected was ANN. This is in spite of GHC-BP producing larger networks than all of the ANN-Miner variations in 39 out of the 40 datasets.

Of course, all versions of ANN-Miner require much more time in the training phase than a simple fixed-topology neural network. However, this affects only the training phase, which usually takes place off-line before an application is deployed, and does not affect the operating phase. In many applications, the time consumed by the training phase is not important compared to the predictive accuracy of the operating phase.

The time consumed by a neural network in the operating phase is a function of the number of connections in the network. Table 6 indicates that the difference in network size between, for example, wANN and the fixed three-layer topology is somewhat modest—ranging from 8 % larger to 24 % larger, with an average of 17 % larger.

9.4 Comparison to state-of-the-art classifiers

As a follow-up experiment, we compare the best two approaches in Table 4, namely wANN-BP and wANN-\(\hbox {ACO}_{\mathbb {R}}\), to several well-established strong classifiers:

  • the Ripper classification rule induction algorithm using its Weka (Witten et al. 2010) implementation JRip;

  • the C4.5 decision tree induction algorithm using its Weka implementation J48;

  • two versions of the support vector machine (SVM) classifier: the quadratic-kernel-based Weka SVM implementation SMO, and the Gaussian kernel-based C-language LibSVM (Chang and Lin 2011) implementation;

as well as to two well-known baseline classifiers:

  • the one-nearest-neighbor algorithm using its Weka implementation IB1;

  • the Naïve-Bayes classifier using its Weka implementation NB.

A recent large-scale empirical study (Fernández-Delgado et al. 2014) used 121 datasets to compare 179 classifiers, representing 17 classifier families and concluded that one of the most effective families was support vector machines, and that the LibSVM implementation with Gaussian kernels in particular was one of the most effective classifiers in general.

We applied each of these six algorithms (LibSVM, SMO, JRip, J48, IB1, NB) to the 40 datasets used in our experiments, with stratified 10-fold cross-validation, as described in Sect. 8, using the same fold partitioning used in the other experiments in this paper. We used Weka’s default parameters for the five Weka implementations and used LibSVM’s default parameters for LibSVM.

The results are shown in Table 7; note that the results for wANN-BP and wANN-\(\hbox {ACO}_{\mathbb {R}}\) are repeated for convenience from Table 4. The last row of the table indicates the average rank of each algorithm. We observe that the best average rank was obtained, not surprisingly, by LibSVM (the SVM implementation with Gaussian kernels), followed in second place by SMO (the SVM implementation with quadratic kernels). These were followed by wANN-\(\hbox {ACO}_{\mathbb {R}}\) in third place, J48 in fourth place, and wANN-BP in fifth place.

Table 7 Predictive accuracy (%) results for two of our proposed algorithms (repeated from Table 4) along with several well-known classifiers

Table 8 reports the results of applying a nonparametric Freidman test with the Holm post-hoc test, at the conventional 0.05 threshold, to compare wANN-\(\hbox {ACO}_{\mathbb {R}}\) (which is treated as the control algorithm) to each of the other algorithms. The Freidman statistic \(\chi _F^2\) is determined to be 43.6 with seven degrees of freedom, corresponding to a p value of 2E\(-\)7. Thus, we can reject the null hypothesis and proceed with the post-hoc tests. For each comparison, we report the computed p value and the corresponding Holm critical value. Statistical significance is detected if the p value is less than or equal to the corresponding Holm threshold. Statistically significant p values are shown in boldface. We observe that:

  • LibSVM is significantly better than ANN-\(\hbox {ACO}_{\mathbb {R}}\);

  • No statistically significant difference is detected between ANN-\(\hbox {ACO}_{\mathbb {R}}\) and any of the other algorithms.

The reader should note that the motivation behind the experiment described in this subsection is only to show that the arbitrary-topology feed-forward neural networks evolved by ANN-Miner have strong predictive accuracy performance (i.e., performance that is similar to that of widely used classifiers). We recognize that what we are comparing is a feed-forward neural network whose topology has been specifically optimized for each dataset, to classifiers such as J48 and SMO with default parameter values that have not been specifically optimized for each dataset. (However, of course, note that the parameters of the ANN-Miner algorithm itself (and its variations) are also general default parameters, and have not been specifically optimized for each dataset.) Each run of ANN-Miner (or its variations) constructs and evaluates up to 5000 neural networks (see Table 2). In the case of the better-performing wANN-Miner variation, each constructed network is trained for 10 epochs, for a total of up to 50,000 epochs. Run-time will therefore generally be much greater than any of the classifiers that we consider in this experiment. Again, the purpose of this experiment is only to show that ANN-Miner is an effective method for evolving neural networks—i.e., that it is capable of evolving neural networks whose predictive accuracy is competitive with established techniques.

Table 8 Results of a Friedman test with the Holm post-hoc test applied to the results of Table 7, using wANN-\(\hbox {ACO}_{\mathbb {R}}\) as the control method

9.5 Comparison with NEAT

As a second follow-up experiment, we compare the results of our approach to the NEAT (Stanley and Miikkulainen 2002) neuroevolutionary algorithm. As described in Sect. 7, NEAT is a sophisticated algorithm that includes the idea of dividing the population into species, with most recombination (or “breeding”) occurring among members of the same species. NEAT is also one of the few neuroevolutionary methods whose source code is publicly available. In fact, NEAT has several publicly available implementations (which can be found at K. Stanley’s NEAT website (Stanley 2015)), including at least four in C++, at least two in Java, in addition to implementations in Matlab, C\(^\sharp \), and Python. We used Ugo Vierucci’s Java implementation, which was based on K. Stanley’s original C++ implementation.

With the exception of population size and number of generations (discussed further below), we used the default parameter settings that were included in NEAT’s source code distribution, which are consistent with the parameter settings used by Stanley and Miikkulainen (2002). In order for the comparison between NEAT and ANN-Miner to be fair, we wanted both techniques to have the same computational budget. In other words, we wanted the total number of constructed networks to be the same for each method. For ANN-Miner and its variations, the total number of constructed networks is the colony size multiplied by the maximum number of iterations. As Table 2 indicates, we used a colony size of 10 and a maximum number of iterations of 500, which means that the total number of constructed networks is limited to no more than 5000. Stanley and Miikkulainen (2002) used a population size of 150 and a number of generations of 24, for a total of 3600 fitness evaluations. To equalize the computational budget of the two methods, we kept NEAT’s population size at 150 and increased the number of generations to 34, for a total of 5100 fitness evaluations. For each dataset, we also set NEAT’s maximum allowable number of hidden neurons to be the same as the maximum number of hidden neurons for ANN-Miner.

Further, for the sake of fairness of comparison, we added a BP post-processing step to NEAT. This means that the final network produced by NEAT underwent 1000 epochs of BP using the BP post-processing parameter settings shown in Table 2. The BP post-processing step in NEAT-BP is identical to the post-processing that is applied in ANN-BP and wANN-BP.

NEAT-BP was applied to the 40 datasets used in our experiments, with stratified 10-times 10-fold cross-validation, as described in Sect. 8, using the same fold partitioning used in the other experiments in this paper. This means that NEAT-BP was run a total of 100 times for each dataset.

Table 9 reports the predictive accuracy results for NEAT-BP; for convenience, we also repeat the predictive accuracy results for two ANN variations (ANN-BP and ANN), and for two baseline classifiers (1NN, and NB). For NEAT-BP and ANN-BP, the table also reports for each dataset the number of inter-neuronal connections, expressed as a multiple of the number of connections in the baseline BP fixed-topology three-layer network (reported in Table 6 in the column labeled 3L). For example, for the annealing dataset, NEAT had an average size (number of inter-neuronal connections) that was 0.43 of the size (i.e., slightly less than half the size) of the baseline fixed-topology three-layer network, while ANN-BP had an average size that was 1.12 of the size of (i.e., slightly larger than) the baseline three-layer topology. The size for ANN is of course always the same as for ANN-BP, and is therefore not shown. The last row reports the average rank for the predictive accuracy columns, and the average size ratio for the model size columns.

Table 9 Predictive accuracy (%) results for NEAT-BP, along with two ANN-Miner variations (repeated from Table 4) and two baseline classifiers (repeated from Table 7), as well as model size results for NEAT-BP and for ANN-BP (repeated from Table 6)

We can make several observations regarding Table 9. We observe that the average number of connections is generally much smaller for NEAT than for ANN-Miner. For example, the number of connections for NEAT is less than a quarter of the connections for ANN for the cylinder dataset, but is almost equal for the car dataset; on average, over all datasets, the ratio of the number of connections for ANN to NEAT is 1.70.

However, NEAT’s smaller model size came at the expense of predictive accuracy. ANN-BP had better accuracy than NEAT-BP on 37 datasets, and worse on three datasets. It is interesting to also compare ANN to NEAT-BP, although this is not a fair comparison since ANN does not employ BP post-processing while NEAT-BP does. We find that ANN had better accuracy than NEAT-BP on 31 datasets, and worse on nine datasets.

Comparing NEAT-BP to the two baseline classifiers, we find the following: compared to one-nearest-neighbor, NEAT-BP had better accuracy on 14 datasets and worse on 26 datasets; compared to NB, NEAT-BP had better accuracy on 13 datasets and worse on 27 datasets.

As Table 9 indicates, there are five datasets for which NEAT-BP’s performance is particularly poor: annealing, chess, ecoli, nursery, and soybean. Several intersecting factors may explain this performance. For four of those datasets (the exception being chess), the number of class labels is large, ranging from five classes for nursery to six classes for annealing to eight for ecoli to a very large 19 classes for soybean. Those five datasets also stand out for their large number of attributes (particularly categorical attributes): 38 attributes (including 29 categorical) for annealing, 36 attributes (all categorical) for chess, seven attributes for ecoli, eight attributes (all categorical) for nursery, and 35 attributes (all categorical) for soybean. Two of those five datasets stand out for their large number of instances: 3,192 instances for chess and 12,960 instances for nursery. The soybean dataset stands out for its small number of instances (307 instances) relative to its large number of attributes (35 attributes) and large number of classes (19 classes). For three of those datasets, specifically annealing, chess and soybean, it is noteworthy that the average model size evolved by NEAT is particularly small (less than half the number of connections in the baseline 3L topology in all three cases).

10 Conclusions and future work directions

The results reported in this paper, using 40 UCI benchmark datasets, indicate that ANN-Miner is an effective algorithm for optimizing the structure of a feed-forward neural network to a specific dataset, producing improved predictive accuracy compared to the standard three-layer topology and compared to a greedy algorithm for neural network structure optimization.

We have investigated several versions of ANN-Miner, and found the best performing version, in terms of predictive accuracy, to be wANN-Miner-\(\hbox {ACO}_{\mathbb {R}}\). In this version, each newly created neural network is initialized with the weights of the best-encountered-so-far network, thus accumulating “wisdom” as the algorithm execution progresses. Furthermore, wANN-Miner-\(\hbox {ACO}_{\mathbb {R}}\) uses BP to train each created neural network during the algorithm’s execution, and then uses \(\hbox {ACO}_{\mathbb {R}}\) as a post-processor to optimize the weights of the final topology. In terms of model size, wANN-Miner produced model sizes that were only slightly larger (specifically 17 % larger on average) than the baseline three-layer topology.

The wANN-Miner-\(\hbox {ACO}_{\mathbb {R}}\) model was then compared to several state-of-the-art classifiers (SVM with Gaussian kernels and with quadratic kernels, Ripper, and C4.5) and to two widely-used baseline classifiers (Nearest Neighbor, and Naïve Bayes). In this comparison wANN-Miner-\(\hbox {ACO}_{\mathbb {R}}\) ranked behind the two SVM variations, in terms of test set predictive accuracy, and ahead of the other four classifiers. Only the Gaussian-kernel SVM was significantly better than wANN-Miner-ACO\(_\mathbb {R}\); no statistically significant difference was detected between wANN-Miner-\(\hbox {ACO}_{\mathbb {R}}\) and any of the other five classifiers. In addition, wANN-Miner-BP was compared to the NEAT neuroevolutionary algorithm and found to have significantly better test set predictive accuracy, when both algorithms were given comparable CPU resources.

In future work, we would like to extend our ACO approach to optimize the structure of adaptive neurofuzzy inference systems (ANFIS) (Jang et al. 1997). Specifically, we would like to optimize the number of fuzzy rules, the number of fuzzy membership functions for each input, and the type of membership functions deployed in the fuzzification layer. This can later be further extended to optimizing the structure of Type-2 Fuzzy Systems (Karnik et al. 1999), for which manual tuning can be more challenging than in conventional fuzzy systems.

There are also several variations to the ANN-Miner algorithm that we would like to explore:

  • We would like to apply the greedy network pruning approach of the GHC algorithm (Algorithm 4) as an additional post-processing step in ANN-Miner. This could potentially result in a reduction in network size without affecting accuracy.

  • Other variations of BP can be used, in place of BP, inside the ANN-Miner algorithm. For example, resilient propagation (RP) often performs better than BP without consuming more CPU time.

  • It is possible to adjust the number of epochs for which BP is allowed to run, inside ANN-Miner, based on the number of connections in the constructed network. The product of the number of epochs by the number of connections can be required to be constant. This means that constructed networks with a larger number of connections would be allowed a smaller number of epochs, while networks with a smaller number of connections would be allowed a larger number of epochs.

  • In the \(\hbox {ACO}_{\mathbb {R}}\) algorithm, we would like to explore using the Cauchy probability distribution in place of the Gaussian probability distribution. The Cauchy has a much wider “tail” and has the potential to promote greater search diversity and help avoid local minima traps.