1 Introduction

A typical artificial neural network (ANN) has two types of basic components. They are neurons which are processing elements and links which are interconnections between neurons. Each link has a weighting parameter. Each neuron receives stimulus from other neurons, processes the information, and produces an output. Neurons are categorized to input, output and hidden neurons. The first and the last layers are called input and output layers, respectively, and the remaining layers are called hidden layers (Zhang and Gupta 2003).

Consider Nk as the number of neurons in the kth layer. Let w kij , represents the weight of the link between the jth neuron of the (k − 1)th layer and the ith neuron of the kth layer. Suppose w ki0 is an additional weighting parameter for each neuron, representing the bias for the ith neuron of the kth layer. The weighting parameters are initialized before training the neural network. During training, they are updated iteratively in a systematic manner (Wang et al. 1999). Once the neural network training is completed, the weighting parameters remain fixed throughout the usage of the neural network as a model.

The process of training an ANN is to adjust weights and biases. The back-propagation (BP) learning has become the most popular method to train feedforward ANNs in many domains (Chronopoulos and Sarangapani 2002). However, one limitation of this technique, which is a gradient-descent technique, is that it requires a differentiable neuron transfer function. Also, as neural networks generate complex error surfaces with multiple local minima, the BP tends to converge into local minima instead of a global minimum (Gupta and Sexton 1999).

In recent years, many improved learning algorithms have been proposed to overcome the handicaps of gradient-based techniques. These algorithms include a direct optimization method using a polytope algorithm (Curry and Morgan 1997), global search techniques such as evolutionary programming (Salchenberger et al. 1992), genetic algorithm (GA) (Sexton et al. 1998; Kim et al. 2005), ant colony optimization (ACO) (Socha and Blum 2007), particle swarm optimization (PSO) (Yu et al. 2008; Garro et al. 2009; Zamani and Sadeghian 2010), differential evolution (DE) (Slowik and Bialko 2008; Garro et al. 2010, 2011), and artificial bee colony algorithm (ABC) (Karaboga and Ozturk 2009). The standard gradient-descent BP is not trajectory-driven, but population driven. However, the improved learning algorithms have explorative search features. Consequently, these methods are expected to avoid local minima frequently by promoting exploration of the search space.

Harmony search (HS) is a meta-heuristic optimization method imitating the music improvisation process where musicians improvise their instruments’ pitches searching for a perfect state of harmony (Lee and Geem 2004, 2005a; Lee et al. 2005b). HS algorithms have been successfully applied to a wide range of practical optimization problems, such as structural optimization (Lee and Geem 2004, 2005a; Lee et al. 2005b), parameter estimation of the nonlinear Muskingum model (Kim et al. 2001), design optimization of water distribution networks (Geem et al. 2002; Geem 2006, 2009a, b), vehicle routing (Geem et al. 2005), combined heat and power economic dispatch (Vasebi et al. 2007), design of steel frames (Degertekin 2008a, b), bandwidth-delay-constrained least-cost multicast routing (Forsati et al. 2008), transport energy modeling (Ceylan et al. 2008) and quadratic programming (Fesanghary et al. 2008). Also, several algorithms, such as novel global harmony search (NGHS) (Zou et al. 2010), have recently been developed based on the adoption of search behavior of the HS. In this paper, the swarm intelligence technique is employed to enhance the performance of the NGHS.

The paper is organized as follows. In Sect. 2, the procedure of HS and NGHS algorithms are briefly presented. The improved NGHS algorithm is presented in Sect. 3. In Sect. 4, this algorithm is employed to train a feedforward ANN. We end this paper with some conclusions in Sect. 5.

2 Harmony search and novel global harmony search algorithms

In this section, harmony search and NGHS algorithms are reviewed.

2.1 HS algorithm

In basic HS algorithm (Lee and Geem 2005a), each solution, called a harmony, is represented by an n-dimension real vector. First, an initial population of harmony vectors is randomly generated and stored in a harmony memory (HM). Then, a new candidate harmony is generated from all of solutions in the HM by performing a memory consideration rule, a pitch adjustment rule and a random re-initialization. Next, the HM is updated by comparing the new candidate harmony and the worst harmony vector in the HM. Finally, the worst harmony vector is replaced by the new candidate vector if the latter is better. The above process is repeated until a certain termination criterion is met. The basic HS consists of three phases, which are initialization, improvisation of a harmony vector and updating the HM. These phases are described as follows.

2.1.1 Initialization of the algorithm parameters

In general, the global optimization problem can be written as

$$ \begin{gathered} \min {\text{f}}\left( {\text{X}} \right) \hfill \\ {\text{s}} . {\text{t}} . : {\text{x}}\left( {\text{j}} \right) \in \left[ {{\text{LB}}\left( {\text{j}} \right) , {\text{UB}}\left( {\text{j}} \right)} \right] ,\quad {\text{j = 1,2, \ldots n}} \hfill \\ \end{gathered} $$
(1)

where f(X) is the objective function, n is the number of decision variables, \( {\text{X = }}\left[ {{\text{x(1),x(2), \ldots ,x(n)}}} \right] \) is the set of decision variables, and LB(j) and UB(j) are the lower and upper bounds for the decision variable x(j), respectively.

The parameters of the HS algorithm are harmony memory size (HMS), i.e., the number of solution vectors in harmony memory, harmony memory consideration rate (HMCR), pitch adjusting rate (PAR), distance bandwidth (BW), and number of improvisations (NI), i.e., the total number of function evaluations. Obviously, a good set of parameters can enhance the algorithm’s ability to search for the global optimum or near optimum region with a high convergence rate.

2.1.2 Initialization of the harmony memory

The number of harmony vectors in the HM is HMS. Let \( {\text{X}}_{\text{i}} = \left[ {{\text{x}}_{\text{i}} ( 1 ) , {\text{x}}_{\text{i}} ( 2 ) , {\text{ \ldots ,x}}_{\text{i}} ( {\text{n)}}} \right] \), which has randomly been generated, represents the ith harmony vector, as follows

$$ \begin{gathered} {\text{x}}_{\text{i}} \left( {\text{j}} \right) = {\text{LB}}\left( {\text{j}} \right) + \left( {{\text{UB}}\left( {\text{j}} \right) - {\text{LB}}\left( \text{j}\right)} \right) \times {\text{r}} \hfill \\ {\text{i}} = 1,2, \ldots ,{\text{HMS}},{\text{j}} = 1, \ldots {\text{n}} \hfill \\ \end{gathered} $$
(2)

where r is a uniform random number between 0 and 1. The HM matrix is then filled with harmony vectors as follows.

$$ {\text{HM}} = \left[ {{\text{X}}_{1} ,{\text{X}}_{2} , \ldots ,{\text{X}}_{\text{HMS}} } \right]^{\text{T}} .$$
(3)

2.1.3 Improvise a new harmony

A new decision variable xnew(j) is improvised by applying three rules, a memory consideration, a pitch adjustment and a random selection. First, a uniform random number r1 is generated in the range [0, 1]. If r1 is less than the value of HMCR, the decision variable xnew(j) is generated by the memory consideration. Otherwise, xnew(j) is obtained by a random selection, i.e., random re-initialization between the search bounds. In the memory consideration, xnew(j) is substituted by a random selection of the jth harmony vector in the HM. Next, if each decision variable xnew(j) is updated by the memory consideration, it will undergo a pitch adjustment with a probability of PAR. The pitch adjustment rule is given by

$$ {\text{x}}_{\text{new}} \left( {\text{j}} \right) = {\text{x}}_{\text{new}} \left( {\text{j}} \right) \pm {\text{r}} \times {\text{BW}} $$
(4)

where r is a uniform random number between 0 and 1.

2.1.4 Updating the harmony memory

Once a new harmony vector Xnew is generated, the HM must be updated by the survival of the fitter competition between Xnew and the worst harmony vector, Xw. If the fitness value of Xnew is better than that of XW, the latter will be replaced by Xnew and become a new member of the HM.

2.1.5 Computational procedure

The computational procedure of the basic HS algorithm for a minimization problem can be summarized as follows (Lee and Geem 2005a).

Step 1 :

Set the parameters HMS, HMCR, PAR, BW, and NI

Step 2 :

Initialize the HM and calculate the objective function values for each harmony vector

Step 3 :

Improvise a new harmony Xnew as follows

Step 4 :

Update the HM as \( X_{worst} = X_{new} \) if \( f(X_{new} ) < f(X_{worst} ) \)

Step 5 :

If NI is completed, return the best harmony vector X best from the HM; otherwise go back to step 3.

2.2 Novel global harmony search algorithm

A recent proposed approach, called NGHS (Zou et al. 2010), modifies the improvisation step of the HS by substituting HMCR and PAR by genetic mutation probability pm. The NGHS improvisation step is as follows. where xjl and xju refer to the lower and upper bands of decision variables, respectively. Here, ‘best’ and ‘worst’ are the indexes of the best and worst harmonies in HM, respectively. Figure 1 illustrates the principle of position updating. The adaptive step for the jth decision variable is defined as \( {\text{Step}}_{\text{j}} = \left| {{\text{x}}_{\text{j}}^{\text{best}} - {\text{x}}_{\text{j}}^{\text{worst}} } \right| \). The region between P and R, which is actually a region near the global-best harmony, is defined as the trust region for the jth decision variable.

Fig. 1
figure 1

The schematic diagram of position updating

A balance between the global search and the local search is kept by dynamically adjusted stepj. To enhance the proposed algorithm’s capacity of escaping from the local optimum, a genetic mutation operation with a small probability is used. The worst harmony X worst in HM is always replaced with the new harmony Xnew.

3 Intelligent global harmony search (IGHS) algorithm

To solve continuous optimization problems, this paper presents a new optimization algorithm, named IGHS. The main motivation for presenting the IGHS is to enhance the ability of HS algorithms in dealing with such optimization problems. In this method, the concept of swarm intelligence, as proposed in PSO technique (Kennedy and Eberhart 1995), is employed to improve the performance of the third step. In PSO technique, a swarm of particles could fly through the search space. Each particle represents a candidate solution to the optimization problem. The position of a particle can be influenced by the best position visited by itself, i.e., its own experience, and the position of the best particle in the swarm, i.e., the experience of swarm.

To enhance the performance of the NGHS, this idea is used in IGHS to modify the improvisation step of the NGHS, as follows. The modification is done in such a way that the new harmony imitates one dimension of the best harmony in the HM.

In comparison with HS algorithms, the IGHS has the following differences.

  1. 1.

    The new harmony is chosen from the range between the values of worst and best harmonies.

  2. 2.

    To improve the performance of this algorithm, the best harmony in HM is used in the third step of the IGHS.

  3. 3.

    To have a suitable population diversity and a good search in the search space, the worst harmony in HM is always replaced with a new harmony.

The computational procedure of the basic IGHS algorithm for a minimization problem can be summarized as follows.

Step 1 :

Set the parameters HMS, HMCR, PAR and NI

Step 2 :

Initialize the HM and calculate the objective function values for each harmony vector

Step 3 :

Improvise a new harmony Xnew as follows

Step 4 :

Update the HM as \( X_{worst} = X_{new} \)

Step 5 :

If NI is completed, return the best harmony vector X best from the HM; otherwise go back to step 3.

4 Simulation results, analysis and discussion

When ANN training is initiated, the iterative process of presenting the training patterns of the dataset to the network’s input continues until a given termination condition is satisfied. This usually happens based on a criterion indicating that the current achieved solution is presumably good enough to stop training. For instance, one common termination criterion in the BP is the difference between the current value of sum of squared errors (SSE) and that obtained in the previous iteration (Fausett 1994).

Figure 2 illustrates a small scale sample ANN. A harmony vector in the HM represents the decision variables in the HS. Each vector represents a complete set of ANN weights including biases. The objective function is to minimize SSE (Hassoun 1995). The squared difference between the target output and actual output determines the amount of error. This is represented by \( ({\text{t}} - {\text{z}})^{2} \) for each pattern and each output unit as shown in Fig. 2. Since the values of ANN weights are usually within the same range we simplify the HS model by setting xL = −1, xU = 1.

Fig. 2
figure 2

A small scale sample ANN

In this study, three benchmark classification problems, which are classification of Iris, Breast Cancer, and Glass, are studied. According to four properties of one flower, the categories of Iris are identified in the Iris problem. This problem consists of 150 examples; all of them can be classified into three categories whilst each category accounts for one-third of the examples. 101 examples are used for training and the rest are used for testing. In the Breast Cancer problem, we correctly diagnose breast lumps as either benign or malignant based on data from examination of cells. This problem consists of 699 examples. Each example includes nine inputs and two outputs. 500 examples are used for training and the rest are used for testing. The Glass dataset predicts the type of glass, either window glass or non-window glass, based on measurements of the chemical content. This problem consists of 214 examples. Each example includes 9 inputs and 2 outputs. 154 examples are used for training and 60 examples are used for testing. To evaluate the performance of the proposed algorithm, we compare its computational results with those provided by the BP, HS, and NGHS.

A benchmark dataset is partitioned into training, validation and testing sets. To prevent over-fitting and, therefore, to improve the classification performance in the test data, training and validation sets may be used individually (Hassoun 1995). However, since evolutionary-based techniques search globally for a solution the use of validation set is not required for them (Sexton and Dorsey 2000; Kiranyaz et al. 2009).

In the BP, the learning rate η is 0.7 whilst 5 % of training data is used for validation check. Table 1 shows the parameter settings of all HS algorithms including IGHS algorithm. Here, ‘NI’ represents the number of iterations. The parameters have been chosen according to the recommendations given by related references (for example, see Lee and Geem 2005a; Zou et al. 2010) and also based on results of simulations.

Table 1 Parameter settings of HS algorithms

For all benchmark problems, the number of hidden neurons for all algorithms is 8. The number of neurons has been chosen to be less than the number of inputs. This increases the speeds of the algorithms. Also, the number of tuning parameters (i.e., neurons and biases) is set to be less than that of training data. This results in a good generalization for the network.

After 20 iterations, the results of classification are demonstrated in Table 2. In this table, the parameter ‘Std’ represents the standard deviation. The values in bold refer to the best results. The results show that the NGHS and IGHS perform better than the BP and HS in terms of classification accuracy and computational expenses. Furthermore, Table 2 shows that the IGHS has a better overall performance in comparison with the NGHS. Moreover, the simulation studies reveal that among all these algorithms the IGHS has a higher chance to find a good solution for a given number of iterations.

Table 2 Simulation results for benchmark problems

The lowest SSE values are found by the BP for iris dataset and by the IGHS for Breast Cancer and Glass datasets. It can be seen from Table 2 that the proposed IGHS algorithm finds higher testing accuracies than BP and other HS algorithms. This means that IGHS algorithm can classify datasets better than the other algorithms considered in this paper.

5 Conclusions

Using the concept of swarm intelligence, this paper presented an intelligent global harmony search algorithm, called IGHS. It was used for training feedforward neural networks for three benchmark classification problems, namely Iris, Breast Cancer, and Glass problems. The performance of the proposed algorithm was compared with that of the BP, HS and NGHS.

Taking classification accuracy and computational expenses into account, simulation results revealed that NGHS and IGHS algorithms performed better than the BP and HS. In comparison with the NGHS, it was also shown that the IGHS had a better overall performance. Moreover, the simulation studies proved that the IGHS had a higher chance to find a good solution for a given number of iterations in comparison with other algorithms. The lowest SSE values were found by the BP for Iris dataset and by the IGHS for Breast Cancer and Glass datasets. Furthermore, the IGHS found higher testing accuracies than BP and other HS algorithms. Therefore, it was concluded that the IGHS performed better in classifying datasets in comparison with other algorithms considered in this paper.