Keywords

1 Introduction

In many applications, we have to deal with complex optimization problems with complicated constraints. Such problems can be challenging to solve, due to their complexity, nonlinearity and potentially high-dimensionality. These problems can even be NP-hard, and thus require alternative methods because conventional methods usually cannot cope such complex problems. In recent years, nature-inspired metaheuristic algorithms have gained huge popularity because they have demonstrated some promising results in solve tough optimization problems. These metaheuristic algorithms include ant colony optimization, particle swarm optimization, cuckoo search, firefly algorithm, bat algorithm, bee algorithms and others [14]. There are many reasons for such popularity. From the algorithm analysis point of view, these algorithms tend to be flexible, efficient and highly adaptable, and yet easy to implement. The high efficiency of these algorithms makes it possible to apply them to a wide range of problems in diverse applications.

Swarm intelligence is quite a general concept that multiple agents interact and exchange information, following simple rules. Rather surprisingly, such simple systems can show complex, self-organized behaviour. Though the characteristics of agent interactions may be drawn from different sources of inspiration in nature [4], algorithmic procedures can be quite simple and flexible, and yet efficient in practice. On the other hand, evolutionary computation is traditionally considered as part of computational intelligence, which concerns optimization with continuous, combinatorial or mixed problems. Algorithms such as genetic algorithms and evolutionary strategy are good examples of evolutionary computation. However, evolutionary computation has broaden its scope and extended to include many areas. Loosely speaking, swarm intelligence is part of the evolutionary computation paradigm, but the interests in swarm intelligence are so overwhelming that swarm intelligence has almost become a field of itself. Here, we will not debate what the right terminology or fields should be. We will discuss both swarm intelligence and evolutionary computation in the most broad sense.

The main purpose of this chapter is to provide an overview and the recent advances concerning swarm intelligence and evolutionary computation. Therefore, the chapter is organized as follows. Section 2 outlines some recent nature-inspired algorithms, followed by the analysis and discussions of adaptation and diversity in these algorithms in Sect. 3. Section 4 discusses the self-tuning framework for parameter tuning and control, while Sect. 5 outlines the convergence analysis of the cuckoo search and the bat algorithm. Finally some discussions and open problems are presented in Sect. 6.

2 Swarm Intelligence, Adaptation and Diversity

Extensive research activities have resulted in significant developments in swarm intelligence (SI) in recent years. It is not possible to cover even a good fraction of the extensive literature in this brief review and thus we have to focus on the most recent algorithms, especially those in the last few years.

2.1 The Essence of an Algorithm

Different disciplines may view an algorithm differently, and the point of view all depends on the perspective. In essence, algorithm A is an iterative process, which aims to generate a new and better solution \( \varvec{x}^{t + 1} \) to a given problem from the current solution \( \varvec{x}^{t} \) at iteration or (pseudo)time t. It can be written as

$$ \varvec{x}^{t + 1} = A(\varvec{x}^{t} ,p), $$
(1)

where p is an algorithm-dependent parameter. For example, the Newton-Raphson method to find the optimal value of \( f(\varvec{x}) \) is equivalent to finding the critical points or roots of \( f^{\prime}(\varvec{x}) = 0 \) in a d-dimensional space. That is,

$$ \varvec{x}^{t + 1} = \varvec{x}^{t} - \frac{{f^{\prime}(\varvec{x}^{t} )}}{{f^{\prime\prime}(\varvec{x}^{t} )}} = A(\varvec{x}^{t} ). $$
(2)

Obviously, the convergence rate may become very slow near the optimal point where \( f^{\prime}(x) \to 0 \). Sometimes, the true convergence rate may not be as quick as it should be. A simple way to improve the convergence is to modify the above formula slightly by introducing a parameter p as follows:

$$ \varvec{x}^{t + 1} = \varvec{x}^{t} - p\frac{{f^{\prime}(\varvec{x}^{t} )}}{{f^{\prime\prime}(\varvec{x}^{t} )}},\quad p = \frac{1}{{1 - A^{\prime}(\varvec{x}_{*} )}}. $$
(3)

Here, \( \varvec{x}_{*} \) is the optimal solution, or a fixed point of the iterative formula. For simplicity, we can treat p as a step size and this essentially becomes the modified Newton-Raphson method.

The above formula is for a trajectory-based, single agent system. For population-based algorithms with a swarm of n solutions \( (\varvec{x}_{1} ,\varvec{x}_{2} , \ldots ,\varvec{x}_{n} ) \), we can extend the above iterative formula to a more general form

$$ \left( {\begin{array}{*{20}c} {\varvec{x}_{1} } \\ {x_{2} } \\ \vdots \\ {\varvec{x}_{n} } \\ \end{array} } \right)^{t + 1} = A\left( {(\varvec{x}_{1}^{t} ,\;\varvec{x}_{2}^{t} , \ldots ,\varvec{x}_{n}^{t} );(p_{1} ,p_{2} , \ldots ,p_{k} );({\epsilon }_{1} ,{\epsilon }_{2} , \ldots ,{\epsilon }_{m} )} \right)\left( {\begin{array}{*{20}c} {\varvec{x}_{1} } \\ {\varvec{x}_{2} } \\ \vdots \\ {\varvec{x}_{n} } \\ \end{array} } \right)^{t} , $$
(4)

where \( p_{1} , \ldots ,p_{k} \) are k algorithm-dependent parameters and \( \epsilon_{1} , \ldots ,\epsilon_{m} \) are m random variables. An algorithm can be viewed as a dynamical system, Markov chains and iterative maps [4], and it can also be viewed as a self-organized system [5]. Here, the introduction of m random variables captures the essence of all contemporary evolutionary algorithms because they all use some sort of randomization techniques.

Whatever the perspective may be, the aim of such an iterative process is to let the system evolve and converge into some stable optimality. In this case, it has strong similarity to a self-organizing system. Such an iterative, self-organizing system can evolve, according to a set of rules or mathematical equations. As a result, such a complex system can interact and self-organize into certain converged states, showing some emergent characteristics of self-organization. In this sense, the proper design of an efficient optimization algorithm is equivalent to finding efficient ways to mimic the evolution of a self-organizing system [5]. In practice, all nature-inspired algorithms try to mimic some successful characteristics of biological, physical or chemical systems in nature [1, 4, 6].

Among all evolutionary algorithms, algorithms based on swarm intelligence (SI) dominate the landscape. There are many reasons for this dominance, though three obvious reasons are: (1) swarm intelligence uses multiple agents as an evolving, interacting population, and thus provides good ways to mimic natural systems. (2) Population-based approaches allow parallelization and vectorization implementations in practice, and are thus straightforward to implement. (3) SI-based algorithms are simple and easy to implement, and they are flexible and yet sufficiently efficient. As a result, these algorithms can deal with a relatively wide range of problems in applications.

2.2 Particle Swarm Optimization

Among the swarm intelligence (SI) based algorithms, particle swarm optimization (PSO) is among the first. PSO was developed by Kennedy and Eberhart in 1995 [1], based on the swarm behaviour of fish or bird schooling in nature. Each particle updates its position \( \varvec{x}_{i} \) and velocity \( \varvec{\upsilon}_{i} \) by

$$ \varvec{\upsilon}_{i}^{t + 1} =\varvec{\upsilon}_{i}^{t} + \alpha \,\varvec{\epsilon}_{1} [\varvec{g}^{*} - \varvec{x}_{i}^{t} ] + \beta\varvec{\epsilon}_{2} [\varvec{x}_{i}^{*} - \varvec{x}_{i}^{t} ], $$
(5)
$$ \varvec{x}_{i}^{t + 1} = \varvec{x}_{i}^{t} +\varvec{\upsilon}_{i}^{t + 1} , $$
(6)

where \( \varvec{\epsilon}_{1} \) and \( \varvec{\epsilon}_{2} \) are two random vectors, drawn from a uniform distribution between 0 and 1. Here, α and β are the learning parameters with typical values of \( \alpha \approx \beta \approx 2 \).

The literature of PSO is vast, with thousands of papers and dozens of books. Therefore, we will not provide detailed literature review here, and readers can find such literature quite easily.

2.3 Some Recent SI-Based Algorithms

In this section, we will focus on the SI-based optimization algorithms that have been developed in recent years and these new algorithms have attracted much attention in the last few years.

2.3.1 Firefly Algorithm

The firefly algorithm (FA) is simple, flexible and easy to implement. FA was developed by Yang in 2008 [2], which was based on the flashing patterns and behaviour of tropical fireflies. FA can naturally deal with nonlinear multimodal optimization problems.

The movement of a firefly i is attracted to another more attractive (brighter) firefly j is determined by

$$ \varvec{x}_{i}^{t + 1} = \varvec{x}_{i}^{t} + \beta_{0} e^{{ - \gamma r_{ij}^{2} }} (\varvec{x}_{j}^{t} - \varvec{x}_{i}^{t} ) + \alpha \,\varvec{\epsilon}_{i}^{t} , $$
(7)

where the second term is due to the attraction of fireflies, and \( \beta_{0} \) is the attractiveness at \( r = 0 \). The third term is randomization with α being the randomization parameter, and \( \varvec{\epsilon}_{i}^{t} \) is a vector of random numbers drawn from a Gaussian distribution at time t. Other studies also use the randomization in terms of \( \varvec{\epsilon}_{i}^{t} \) that can easily be extended to other distributions such as Lévy flights. A comprehensive review of the firefly algorithm and its variants has been carried out by Fister et al. [79].

One novel feature of FA is that distance-related attraction is used, and this is the first of its kind in any SI-based algorithms. Since local attraction is stronger than long-distance attraction, the population in FA can automatically subdivide into multiple subgroups, and each group can potentially swarm around a local mode. Among all the local modes, there is always a global best solution which is the true optimality of the problem. Thus, FA can deal with multimodal problems naturally and efficiently [3].

2.3.2 Cuckoo Search

The cuckoo search (CS) was developed in 2009 by Yang and Deb [10]. CS is based on the brood parasitism of some cuckoo species. In addition, this algorithm is enhanced by the so-called Lévy flights [11], rather than by simple isotropic random walks. Recent studies show that CS is potentially far more efficient than PSO and genetic algorithms [1214].

In essence, CS uses a balanced combination of a local random walk and the global explorative random walk, controlled by a switching parameter \( p_{a} \). The local random walk can be written as

$$ \varvec{x}_{i}^{t + 1} = \varvec{x}_{i}^{t} + \alpha s \otimes H(p_{a} - \epsilon ) \otimes (\varvec{x}_{j}^{t} - \varvec{x}_{k}^{t} ), $$
(8)

where \( \varvec{x}_{j}^{t} \) and \( \varvec{x}_{k}^{t} \) are two different solutions selected randomly by random permutation, \( H(u) \) is a Heaviside function, ϵ is a random number drawn from a uniform distribution, and s is the step size. On the other hand, the global random walk is carried out by using Lévy flights

$$ \varvec{x}_{i}^{t + 1} = \varvec{x}_{i}^{t} + \alpha L(s,\uplambda ), $$
(9)

where

$$ L(s,\uplambda ) = \frac{\uplambda \varGamma (\uplambda )\sin (\pi \uplambda /2)}{\pi }\frac{1}{{s^{1 + \uplambda } }},\quad (s{ \gg }s_{0} > 0). $$
(10)

Here \( \alpha > 0 \) is the step size scaling factor, which should be related to the scales of the problem of interest.

CS has two distinct advantages over other algorithms such as GA and SA, and these advantages are: efficient random walks and balanced mixing. Since Lévy flights are usually far more efficient than any other random-walk-based randomization techniques, CS can be efficient in global search [3, 11]. In fact, recent studies show that CS can have guaranteed global convergence [4]. In addition, the similarity between eggs can produce better new solutions, which is essentially fitness-proportional generation with a good mixing ability. In other words, CS has varying mutation realized by Lévy flights, and the fitness-proportional generation of new solutions based on similarity provides a subtle form of crossover. In addition, simulations also show that CS can have autozooming ability in the sense that new solutions can automatically zoom into the region where the promising global optimality is located.

In addition, Eq. (9) is essentially simulated annealing in the framework of Markov chains. In Eq. (8), if \( p_{a} = 1 \) and \( \alpha s \in [0,1] \), CS can degenerate into a variant of differentia evolution. Furthermore, if we replace \( \varvec{x}_{j}^{t} \) by the current best solution \( \varvec{g}^{*} \), then (8) can further degenerate into accelerated particle swarm optimization (APSO) [15]. This means that SA, DE and APSO are special cases of CS, and this explains why CS is so efficient [3].

2.3.3 Bat Algorithm

The bat algorithm (BA) was developed by Yang in 2010 [16]. It was inspired by the echolocation behavior of microbats. It is the first algorithm of its kind to use frequency tuning. Each bat is associated with a velocity \( \varvec{\upsilon}_{i}^{t} \) and a location \( \varvec{x}_{i}^{t} \), at iteration t, in a d-dimensional search or solution space. Among all the bats, there exists a current best solution \( \varvec{x}_{*} \). Therefore, the updating equations for \( \varvec{x}_{i}^{t} \) and velocities \( \varvec{\upsilon}_{i}^{t} \) can be written as

$$ f_{i} = f_{\hbox{min} } + (f_{\hbox{max} } - f_{\hbox{min} } )\beta , $$
(11)
$$ \varvec{\upsilon}_{i}^{t} =\varvec{\upsilon}_{i}^{t - 1} + (\varvec{x}_{i}^{t - 1} - \varvec{x}_{*} )f_{i} , $$
(12)
$$ \varvec{x}_{i}^{t} = \varvec{x}_{i}^{t - 1} +\varvec{\upsilon}_{i}^{t} , $$
(13)

where \( \beta \in [0,1] \) is a random vector drawn from a uniform distribution.

The loudness and pulse emission rates are regulated by the following equations:

$$ A_{i}^{t + 1} = \alpha A_{i}^{t} , $$
(14)

and

$$ r_{i}^{t + 1} = r_{i}^{0} [1 - \exp ( - \gamma t)], $$
(15)

where \( 0 < \alpha < 1 \) and \( \gamma > 0 \) are constants. In essence, here α is similar to the cooling factor of a cooling schedule in simulated annealing.

BA has been extended to multiobjective bat algorithm (MOBA) by Yang [17], and Fister et al. have extended to a hybrid bat algorithm [18]. The preliminary results suggested that they are very efficient [19, 20].

2.3.4 Flower Algorithm

Flower pollination algorithm (FPA) was developed by Yang in 2012 [21], inspired by the flower pollination process of flowering plants. It has been extended to multiobjective optimization problems and found to be very efficient [22, 23]. For simplicity, we use the following four rules:

  1. 1.

    Biotic and cross-pollination can be considered as a process of global pollination process, and pollen-carrying pollinators move in a way which obeys Lévy flights (Rule 1).

  2. 2.

    For local pollination, abiotic pollination and self-pollination are used (Rule 2).

  3. 3.

    Pollinators such as insects can develop flower constancy, which is equivalent to a reproduction probability that is proportional to the similarity of two flowers involved (Rule 3).

  4. 4.

    The interaction or switching of local pollination and global pollination can be controlled by a switch probability \( p \in [0,1] \), with a slight bias towards local pollination (Rule 4).

In order to formulate updating formulae, we have to convert the above rules into updating equations. For example, in the global pollination step, flower pollen gametes are carried by pollinators such as insects, and pollen can travel over a long distance because insects can often fly and move in a much longer range. Therefore, Rule 1 and flower constancy can be represented mathematically as

$$ \varvec{x}_{i}^{t + 1} = \varvec{x}_{i}^{t} + \gamma L(\uplambda )(\varvec{g}_{*} - \varvec{x}_{i}^{t} ), $$
(16)

where \( \varvec{x}_{i}^{t} \) is the pollen i or solution vector \( \varvec{x}_{i} \) at iteration t, and \( \varvec{g}_{*} \) is the current best solution found among all solutions at the current generation/iteration. Here γ is a scaling factor to control the step size.

Here \( L(\uplambda ) \) is the parameter that corresponds to the strength of the pollination, which essentially is also a step size. Since insects may move over a long distance with various distance steps, we can use a Lévy flight to mimic this characteristic efficiently. That is, we draw \( L > 0 \) from a Levy distribution

$$ L{ \sim }\frac{\uplambda \varGamma (\uplambda )\sin (\pi \uplambda /2)}{\pi }\frac{1}{{s^{1 + \uplambda } }},\quad (s{ \gg }s_{0} > 0). $$
(17)

Here \( \varGamma (\uplambda ) \) is the standard gamma function, and this distribution is valid for large steps \( s > 0 \). This step is essentially a global mutation step, which enables to explore the search space more efficiently.

For the local pollination, both Rule 2 and Rule 3 can be represented as

$$ \varvec{x}_{i}^{t + 1} = \varvec{x}_{i}^{t} + \epsilon (\varvec{x}_{j}^{t} - \varvec{x}_{k}^{t} ), $$
(18)

where \( \varvec{x}_{j}^{t} \) and \( \varvec{x}_{k}^{t} \) are pollen from different flowers of the same plant species. This essentially mimics the flower constancy in a limited neighborhood. Mathematically, if \( \varvec{x}_{j}^{t} \) and \( \varvec{x}_{k}^{t} \) comes from the same species or selected from the same population, this equivalently becomes a local random walk if we draw ϵ from a uniform distribution in [0, 1]. In essence, this is a local mutation and mixing step, which can help to converge in a subspace.

In principle, flower pollination activities can occur at all scales, both local and global. But in reality, adjacent flower patches or flowers in the not-so-far-away neighborhood are more likely to be pollinated by local flower pollen than those far away. In order to mimic this feature, we can effectively use a switch probability (Rule 4) or proximity probability p to switch between common global pollination to intensive local pollination. To start with, we can use a naive value of \( p = 0.5 \) as an initially value. A preliminary parametric showed that \( p = 0.8 \) may work better for most applications.

Recent studies suggested that flower pollination algorithm is very efficient for multiobjective optimization [23, 24].

2.4 Other Evolutionary Algorithms

There are quite a few other evolutionary algorithms, though they are not swarm intelligence based algorithms. So we also briefly introduce them here.

2.4.1 Differential Evolution

Differential evolution (DE) was developed by Storn and Price in 1996 and 1997 [25, 26]. In fact, modern differential evolution (DE) has strong similarity to the traditional mutation operator in the traditional pattern search. In essence, the mutation in DE can be viewed as the generalized pattern search in any random direction \( (\varvec{x}_{p} - \varvec{x}_{q} ) \) by

$$ \varvec{x}_{i} = \varvec{x}_{r} + F(\varvec{x}_{p} - \varvec{x}_{q} ), $$
(19)

where F is the differential weight in the range of [0,2]. Here, r, p, q, i are four different integers generated by random permutation.

In addition, DE also has a crossover operator which is controlled by a crossover probability \( C_{r} \in [0,\;1] \) and the actual crossover can be carried out in two ways: binomial and exponential. Selection is essentially the same as that used in genetic algorithms. It is to select the most fittest, and for the minimization problem, the minimum objective value. Therefore, we have

$$ \varvec{x}_{i}^{t + 1} = \left\{ {\begin{array}{*{20}l} {\varvec{u}_{i}^{t + 1} } \hfill & {{\text{if}}\,\,f(\varvec{u}_{i}^{t + 1} ) \le f(\varvec{x}_{i}^{t} ),} \hfill \\ {\varvec{x}_{i}^{t} } \hfill & {{\text{otherwise}}.} \hfill \\ \end{array} } \right. $$
(20)

Most studies have focused on the choice of F, \( C_{r} \), and the population size n as well as the modification of the mutation scheme. In addition, it can be clearly seen that selection is also used when the condition in the above equation is checked. Almost all variants of DE use crossover, mutation and selection, and the main differences are in the step of mutation and crossover. For example, DE/Rand/1/Bin use the three vectors for mutation, and binomial crossover. There are more than 10 different variants [27].

2.4.2 Harmony Search

Harmony search (HS) is a music-inspired algorithm, developed by Geem et al. in 2001 [28]. It is not swarm-intelligence-based, but it is a metaheuristic algorithm. In the standard HS, solutions are represented in terms of a population of harmonies, using the following three choices/rules (of a musician playing a piece of music): (1) play any famous piece of music from memory, (2) play something similar to a known piece (pitch adjustment), and (3) compose new or random notes. HS uses mainly mutation and selection, while crossover is not explicitly used. The first rule corresponds to selection or elitism, and the second and third rules are mutation.

Mutation can be local and global in HS. For example, the pitch adjustment (the second rule) uses the following equation

$$ \varvec{x}_{new} = \varvec{x}_{old} + b_{w} \varepsilon , $$
(21)

where \( b_{w} \) is the bandwidth of the pitch adjustment, while ε is a random number drawn from \( [ - 1,1] \). This is a local random walk, and the distance of the random walk is controlled by the bandwidth. This part can be considered as a local mutation action with an equivalent mutation rate of 0.1–0.3.

The third rule is essentially mutation on a larger scale, which is essentially equivalent to random walks. The selection is controlled by the probability of choosing a harmony from harmony memory. Similar to genetic algorithms, this choice of harmonies from the population is high with a typical value of 0.9, which enables the system to converge in a subspace. However, this may be at the expense of reduced probability of finding the global optimality in some highly nonlinear problems.

2.4.3 Other Algorithms

Many other algorithms have appeared in the literature, which may require more extensive analysis and comparisons [4]. However, as this is not the main focus of this chapter, we will not go into more details about these algorithms.

One thing we may notice by analyzing these algorithms is that mutation and selection are always used, while crossover are not used in most of these algorithms. This may raise the question and further need to analyze what exactly the role of crossover is. Therefore, there is a strong need to investigate further how two-stage eagle strategy and co-evolutionary methods can work better. In addition, systematic tuning of parameters in algorithms and careful control of these algorithm-dependent parameters may be very useful to understand how these algorithms behave and how to improve them in practice.

3 Adaptation and Diversity

The effectiveness of these algorithms can be attributed to two important characteristics: adaptation and diversity of nature-inspired optimization algorithms.

Adaptation in nature-inspired algorithms can take many forms. For example, the ways to balance exploration and exploitation are the key form of adaptation [29]. As diversity can be intrinsically linked with adaptation, it is better not to discuss these two features separately. If exploitation is strong, the search process will use problem-specific information (or landscape-specific information) obtained during the iterative process to guide the new search moves, this may lead to the focused search and thus reduce the diversity of the population, which may help to speed up the convergence of the search procedure. However, if exploitation is too strong, it can result in the quick loss of diversity in the population and thus may lead to the premature convergence. On the other hand, if new search moves are not guided by local landscape information, it can typically increase the exploration capability and generate new solutions with higher diversity. However, too much diversity and exploration may result in meandered search paths, thus lead to the slow convergence. Therefore, adaptation of search moves so as to balance exploration and exploitation is crucial. Consequently, to maintain a balanced diversity in population is also important.

Adaptation can also be in terms of the representations of solutions of a problem. In genetic algorithms, representations of solutions are usually in binary or real-valued strings [29, 30], while in swarm-intelligence-based algorithms, representations mostly use real number solution vectors. For example, the population size used in an algorithm can be fixed or varying. Adaptation in this case may mean to vary the population size so as to maximize the overall performance.

For a given algorithm, adaptation can also occur to adjust its algorithm-dependent parameters. As the performance of an algorithm can largely depend on its parameters, the choice of these parameter values can be very important. Values can be varied so as to adapt the landscape type of the problem and thus may lead to better search efficiency. Such parameter tuning is in essence parameter adaptation. Once parameters are tuned, they can remain fixed. However, there is no particular reason why parameters should be fixed. In fact, adaptation in parameter can be extended to parameter control. That is to control the parameter values in such a way that their values vary during the iterations so that optimal performance of the algorithm can be achieved.

Diversity in metaheuristic algorithms can also take many forms. The simplest diversity is to allow the variations of solutions in the population by randomization. For example, solution diversity in genetic algorithms is mainly controlled by mutation rates and crossover mechanisms, while in simulated annealing, diversity is achieved by random walks. In most swarm-intelligence-based algorithms, new solutions are generated according to a set of deterministic equations, which also include some random variables. Diversity is represented by the variation, often in terms of population variance. Once the population variance is getting smaller (approaching zero), diversity also decreases, leading to converged solution sets. However, if diversity is reduced too quickly, premature convergence may occur. Therefore, a right amount of randomness and the right form of randomization can be crucial.

In addition, adaptation and diversity can also be related to the selection of solutions among the population and the replacement of the old population. If the selection is based on the fitness, parent solutions with higher fitness will be more likely to pass onto the next generation. In the extreme case, only the best solutions can be selected, which is a kind of elitism. If the replacement of worst solutions by new (hopefully better) solutions, this will ensure that better solutions will remain in the population. The balance of what to replace and what to pass on can be tricky, which requires good adaptation so as to maintain good diversity in the population.

4 Self-Tuning Algorithms

From a mathematical point of view, an algorithm A tends to generate a new and better solution \( \varvec{x}^{t + 1} \) to a given problem from the current solution \( \varvec{x}^{t} \) at iteration or time t. In modern metaheuristic algorithms, randomization is often used in an algorithm, and in many cases, randomization appears in the form of a set of m random variables \( \varepsilon = (\varepsilon_{1} , \ldots ,\varepsilon_{m} ) \) in an algorithm. For example, in simulated annealing, there is one random variable, while in particle swarm optimization, there are two random variables. In addition, there are often a set of k parameters in an algorithm. For example, in particle swarm optimization, there are 4 parameters (two learning parameters, one inertia weight, and the population size). In general, we can have a vector of parameters \( \varvec{p} = (p_{1} , \ldots ,p_{k} ) \). Mathematically speaking, we can write an algorithm with k parameters and m random variables as

$$ \varvec{x}^{t + 1} = \varvec{A}(\varvec{x}^{t} ,\varvec{p}(t),\varepsilon {\kern 1pt} (t)), $$
(22)

where A is a nonlinear mapping from a given solution (a d-dimensional vector \( \varvec{x}^{t} \)) to a new solution vector \( \varvec{x}^{t + 1} \).

Representation (22) gives rise to two types of optimality: optimality of a problem and optimality of an algorithm [4, 31]. For an optimization problem such as \( \hbox{min} \;f(\varvec{x}) \), there is a global optimal solution whatever the algorithmic tool we may use to find this optimality. This is the optimality for the optimization problem. On the other hand, for a given problem Φ with an objective function \( f(\varvec{x}{\kern 1pt} ) \), there are many algorithms that can solve it. Some algorithms may require less computational effort than others. There may be the best algorithm with the least computing cost, though this may not be unique. However, this is not our concern here. Once we have chosen an algorithm A to solve a problem Φ, there is an optimal parameter setting for this algorithm so that it can achieve the best performance. This optimality depends on both the algorithm itself and the problem it solves. In the rest of this chapter, we will focus on this type of optimality.

That is, the optimality to be achieved is

$$ {\text{Maximize}}\;{\text{the}}\;{\text{performance}}\;{\text{of}}\;\xi = A(\varPhi ,\varvec{p},\varepsilon ), $$
(23)

for a given problem Φ and a chosen algorithm \( A(.,\varvec{p},\varepsilon {\kern 1pt} ) \). We will denote this optimality as \( \xi_{*} = A_{*} (\varPhi ,{\kern 1pt} \varvec{p}_{*} ) = \xi (\varPhi ,\varvec{p}_{*} ) \) where \( \varvec{p}_{*} \) is the optimal parameter setting for this algorithm so that its performance is the best. Here, we have used a fact that ε is a random vector can be drawn from some known probability distributions, thus the randomness vector should not be related to the algorithm optimality.

It is worth pointing out that there is another potential optimality. That is, for a given problem, a chosen algorithm with the best parameter setting \( \varvec{p}_{*} \), we can still use different random numbers drawn from various probability distributions and even chaotic maps, so that even better performance may be achieved. Strictly speaking, if an algorithm \( A(.,.,\varepsilon ) \) has a random vector ε that is drawn from a uniform distribution \( \varepsilon_{1} { \sim }U(0,1) \) or from a Gaussian \( \varepsilon_{2} { \sim }N(0,1) \), it becomes two algorithms \( A_{1} = A(.,.,\;\varepsilon_{1} ) \) and \( A_{2} = A(.,.,\;\varepsilon_{2} ) \). Technically speaking, we should treat them as different algorithms. Since our emphasis here is about parameter tuning so as to find the optimal setting of parameters, we will omit the effect of randomness vectors, and thus focus on

$$ {\text{Maximize}}\;\xi = A(\varPhi ,\varvec{p}). $$
(24)

In essence, tuning algorithm involves in tuning its algorithm-dependent parameters. Therefore, parameter tuning is equivalent to algorithm tuning in the present context.

4.1 Parameter Tuning

In order to tune \( A(\varPhi ,\varvec{p}) \) so as to achieve its best performance, a parameter-tuning tool, i.e., a tuner, is needed. Like tuning a high-precision machinery, sophisticated tools are required. For tuning parameters in an algorithm, what tool can we use? One way is to use a better, existing tool (say, algorithm B) to tune an algorithm A. Now the question may become: how do you know B is better? Is B well-tuned? If yes, how do you tune B in the first place? Naively, if we say, we use another tool (say, algorithm C) to tune B. Now again the question becomes how algorithm C has been tuned? This can go on and on, until the end of a long chain, say, algorithm Q. In the end, we need some tool/algorithm to tune this Q, which again comes back to the original question: how to tune an algorithm A so that it can perform best?

It is worth pointing out that even if we have good tools to tune an algorithm, the best parameter setting and thus performance all depend on the performance measures used in the tuning. Ideally, these parameters should be robust enough to minor parameter changes, random seeds, and even problem instance. However, in practice, they may not be achievable. According to Eiben [32], parameter tuning can be divided into iterative and non-iterative tuners, single-stage and multi-stage tuners. The meaning of these terminologies is self-explanatory. In terms of the actual tuning methods, existing methods include sampling methods, screening methods, model-based methods, and metaheuristic methods. Their success and effectiveness can vary, and thus there are no well-established methods for universal parameter tuning.

4.2 Framework for Self-Tuning Algorithms

From our earlier observations and discussions, it is clear that parameter tuning is the process of optimizing the optimization algorithm; therefore, it is a hyper-optimization problem. In essence, a tuner is a meta-optimization tool for tuning algorithms [31].

For a standard unconstrained optimization problem, the aim is to find the global minimum \( f_{*} \) of a function \( f(\varvec{x}) \) in a d-dimensional space. That is,

$$ {\text{Minimize}}\;f(\varvec{x}{\kern 1pt} ),\quad \varvec{x} = (x_{1} ,x_{2} , \ldots ,x_{d} ). $$
(25)

Once we choose an algorithm A to solve this optimization problem, the algorithm will find a minimum solution \( f_{\hbox{min} } \) which may be close to the true global minimum \( f_{*} \). For a given tolerance δ, this may requires \( t_{\delta } \) iterations to achieve \( |f_{\hbox{min} } - f_{*} | \le \delta \). Obviously, the actual \( t_{\delta } \) will largely depend on both the problem objective \( f(\varvec{x}{\kern 1pt} ) \) and the parameters p of the algorithm used.

The main aim of algorithm-tuning is to find the best parameter setting \( \varvec{p}_{*} \) so that the computational cost or the number of iterations \( t_{\delta } \) is the minimum. Thus, parameter tuning as a hyper-optimization problem can be written as

$$ {\text{Minimize}}\;t_{\delta } = A(f(\varvec{x}{\kern 1pt} ),\varvec{p}), $$
(26)

whose optimality is \( \varvec{p}_{*} \).

Ideally, the parameter vector \( \varvec{p}_{*} \) should be sufficiently robust. For different types of problems, any slight variation in \( \varvec{p}_{*} \) should not affect the performance of A much, which means that \( \varvec{p}_{*} \) should lie in a flat range, rather than at a sharp peak in the parameter landscape.

4.3 A Multiobjective View

If we look the algorithm tuning process from a different perspective, it is possible to construct it as a multi-objective optimization problem with two objectives: one objective \( f(\varvec{x}) \) for the problem Φ and one objective \( t_{\delta } \) for the algorithm. That is

$$ {\text{Minimize}}\;f(\varvec{x})\;{\text{and}}\;{\text{Minimize}}\;t_{\delta } = A(f(\varvec{x}),\varvec{p}), $$
(27)

where \( t_{\delta } \) is the (average) number of iterations needed to achieve a given tolerance δ so that the found minimum \( f_{\hbox{min} } \) is close enough to the true global minimum \( f_{*} \), satisfying \( |f_{\hbox{min} } - f_{*} | \le \delta \).

This means that for a given tolerance δ, there will be a set of best parameter settings with a minimum \( t_{\delta } \). As a result, the bi-objectives will form a Pareto front. In principle, this bi-objective optimization problem (27) can be solved by any methods that are suitable for multiobjective optimization. But as δ is usually given, a natural way to solve this problem is to use the so-called ϵ-constraint or δ-constraint methods. The naming may be dependent on the notations; however, we will use δ-constraints.

For a given \( \delta \ge 0 \), we change one of the objectives (i.e., \( f(\varvec{x}) \)) into a constraint, and thus the above problem (27) becomes a single-objective optimization problem with a constraint. That is

$$ {\text{Minimize}}\;t_{\delta } = A(f(\varvec{x}{\kern 1pt} ),\varvec{p}), $$
(28)

subject to

$$ f(\varvec{x}{\kern 1pt} ) \le \delta . $$
(29)

In the rest of this chapter, we will set \( \delta = 10^{ - 5} \).

The important thing is that we still need an algorithm to solve this optimization problem. However, the main difference from a common single objective problem is that the present problem contains an algorithm A. Ideally, an algorithm should be independent of the problem, which treats the objective to be solved as a black box. Thus we have \( A(.,\varvec{p},\varepsilon {\kern 1pt} ) \), however, in reality, an algorithm will be used to solve a particular problem Φ with an objective \( f(\varvec{x}) \). Therefore, both notations \( A(.,\varvec{p}) \) and \( A(f(\varvec{x}{\kern 1pt} ),\varvec{p}) \) will be used here.

4.4 Self-Tuning Framework

This framework has been proposed by Yang et al. in 2013 [31]. In principle, we can solve (28) by any efficient or well-tuned algorithm. Now a natural question is: Can we solve this algorithm-tuning problem by the algorithm A itself? There is no reason why we cannot. In fact, if we solve (28) by using A, we have a self-tuning algorithm. That is, the algorithm automatically tunes itself for a given problem objective to be optimized. This essentially provides a framework for a self-tuning algorithm as shown in Fig. 1.

Fig. 1
figure 1

A framework for a self-tuning algorithm

This framework is generic in the sense that any algorithm can be tuned this way, and any problem can be solved within this framework. This essentially achieves two goals simultaneously: parameter tuning and optimality finding.

4.5 Self-Tuning Firefly Algorithm

Now let us use the framework outlined earlier to tune the firefly algorithm (FA). As we have seen earlier, FA has the following updating equation:

$$ \varvec{x}_{i}^{t + 1} = \varvec{x}_{i}^{t} + \beta_{0} e^{{ - \gamma r_{ij}^{2} }} (\varvec{x}_{j}^{t} - \varvec{x}_{i}^{t} ) + \alpha\varvec{\epsilon}_{i}^{t} , $$
(30)

which contains four parameters: \( \alpha ,\beta_{0} ,\gamma \) and the population size n. For simplicity for parameter tuning, we set \( \beta_{0} = 1 \) and \( n = 20 \), and therefore the two parameters to be tuned are: \( \gamma > 0 \) and \( \alpha > 0 \). It is worth pointing out that γ controls the scaling, while α controls the randomness. For this algorithm to convergence properly, randomness should be gradually reduced, and one way to achieve such randomness reduction is to use

$$ \alpha = \alpha_{0} \theta^{t} ,\quad \theta \in (0,1), $$
(31)

where t is the index of iterations/generations. Here \( \alpha_{0} \) is the initial randomness factor, and we can set \( \alpha_{0} = 1 \) without losing generality. Therefore, the two parameters to be tuned become γ and θ. This framework works wells as shown by Yang et al. [31].

5 Convergence Analysis

There some solid mathematical analysis of nature-inspired algorithms such as differential evolution, genetic algorithms [33], simulated annealing and particle swarm optimization. In recent years, more theoretical results appear, and in the rest of this section, we summarize some of the recent advances.

5.1 Global Convergence of Cuckoo Search

Wang et al. provided a mathematical proof of global convergence for the standard cuckoo search, and their approach is based on the Markov chain theory [34]. Their proof can be outlined as follows:

As there are two branches in the updating formulas, the local search step only contributes mainly to local refinements, while the main mobility or exploration is carried out by the global search step. In order to simplify the analysis and also to emphasize the global search capability, we now use a simplified version of cuckoo search. That is, we use only the global branch with a random number \( r \in [0,1] \), compared with a discovery/switching probability \( p_{a} \). Now we have

$$ \left\{ {\begin{array}{*{20}l} {\varvec{x}_{i}^{(t + 1)} \leftarrow x_{i}^{(t)} } \hfill & {{\text{if}}\;r < p_{a} ,} \hfill \\ {\varvec{x}_{i}^{(t + 1)} \leftarrow \varvec{x}_{i}^{(t)} + \alpha \otimes L(\uplambda )} \hfill & {{\text{if}}\;r > p_{a} .} \hfill \\ \end{array} } \right. $$
(32)

As our cuckoo search algorithm is a stochastic search algorithm, we can summarize it as the following key steps:

  1. 1.

    Randomly generate an initial population of n nests at the positions, \( \varvec{X} = \{ \varvec{x}_{1}^{0} ,\varvec{x}_{2}^{0} , \ldots ,\varvec{x}_{n}^{0} \} \), then evaluate their objective values so as to find the current global best \( \varvec{g}_{t}^{0} \).

  2. 2.

    Update the new solutions/positions by

    $$ \varvec{x}_{i}^{(t + 1)} = \varvec{x}_{i}^{(t)} + \alpha \otimes L(\uplambda ). $$
    (33)
  3. 3.

    Draw a random number r from a uniform distribution [0,1]. Update \( \varvec{x}_{i}^{(t + 1)} \) if \( r > p_{a} \). Then, evaluate the new solutions so as to find the new, global best \( \varvec{g}_{t}^{*} \).

  4. 4.

    If the stopping criterion is met, then \( \varvec{g}_{t}^{*} \) is the best global solution found so far. Otherwise, return to step (2).

The global convergence of an algorithm. If f is measurable and the feasible solution space Ω is a measurable subset on \( \Re^{n} \), algorithm A satisfies the above two conditions with the search sequence \( \{ x_{k} \}_{k = 0}^{\infty } \), then

$$ \mathop {\lim }\limits_{k \to \infty } P(x_{k} \in R_{\epsilon ,M} ) = 1. $$
(34)

That is, algorithm A can converge globally with a probability of one. Here \( P(x_{k} \in R_{\epsilon,M}) \) is the probability measure of the kth solution on \( R_{\epsilon,M} \) at the kth iteration.

The state and state space. The positions of a cuckoo/nest and its global best solution g in the search history forms the states of cuckoos: \( y = (x,g) \), where \( x,g \in \varOmega \) and \( f(g) \le f(x) \). The set of all the possible states forms the state space, denoted by

$$ Y = \{ y = (x,g)|x,g \in \varOmega ,f(g) \le f(x)\} . $$
(35)

The states and state space of the cuckoo group/population. The states of all n cuckoos/nests form the states of the group, denoted by \( q = (y_{1} ,y_{2} , \ldots ,y_{n} ) \). All the states of all the cuckoos form a state space for the group, denoted by

$$ Q = \{ q = (y_{1} ,y_{2} , \ldots ,y_{n} ),y_{i} \in Y,1 \le i \le n\} . $$
(36)

Obviously, Q contains the historical global best solution \( g^{*} \) for the whole population and all individual best solutions \( g_{i} (1 \le i \le n) \) in history. In addition, the global best solution of the whole population is the best among all \( g_{i} \), so that \( f(g^{*} ) = \hbox{min} (f(g_{i} )),\;1 \le i \le n \).

The transition probability from state \( y_{1} \) to \( y_{2} \) in cuckoo search is

$$ P(T_{y} (y_{1} ) = y_{2} ) = P(x_{1} \to x^{\prime}_{{1}} )P(g_{1} \to g^{\prime}_{{1}} )P(x^{\prime}_{{1}} \to x_{2} )P(g^{\prime}_{{1}} \to g_{2} ), $$
(37)

where \( P(x_{1} \to x^{\prime}_{{1}} ) \) is the transition probability at Step 2 in cuckoo search, and \( P(g_{1} \to g^{\prime}_{{1}} ) \) is the transition probability for the historical global best at this step. \( P(x^{\prime}_{{1}} \to x_{2} ) \) is the transition probability at Step 3, while \( P(g^{\prime}_{{1}} \to g_{2} ) \) is the transition probability of the historical global best.

For globally optimal solution \( g_{b} \) for an optimization problem \( < \varOmega ,f > \), the optimal state set is defined as \( R = \{ y = (x,g)|f(g) = f(g_{b} ),y \in Y\} \).

For the globally optimal solution \( g_{b} \) to an optimization problem \( < \varOmega ,f > \), the optimal group state set can be defined as

$$ H = \{ q = (y_{1} ,y_{2} , \ldots ,y_{n} )|\exists y_{i} \in R,1 \le i \le n\} . $$
(38)

All these will ensure that the convergence conditions are met. Further detailed mathematical analysis proves that when the number of iteration approaches sufficiently large [34], the group state sequence will converge to the optimal state/solution set H. Therefore, the cuckoo search has guaranteed global convergence.

5.2 Convergence of the Bat Algorithm

Huang et al. have carried out a detailed convergence analysis for the bat algorithm using the finite Markov process theory [35].

In theory, an algorithm with an order-m reducible stochastic matrix P can be rewritten as

$$ P = \left( {\begin{array}{*{20}l} {S \ldots 0} \hfill \\ {R \ldots T} \hfill \\ \end{array} } \right), $$
(39)

where \( R \ne 0 \), \( T \ne 0 \), and S is order-q stochastic matrix (with \( q < m \)). Then, we have

$$ \begin{aligned} P^{\infty } & = \mathop {\lim }\limits_{k \to \infty } P^{k} \\ & = \mathop {\lim }\limits_{k \to \infty } \left( {\begin{array}{*{20}c} {S^{k} } & \ldots & 0 \\ {\sum\limits_{i = 1}^{k - 1} T^{i} RS^{k - i} } & \ldots & {T^{k} } \\ \end{array} } \right) = \left( {\begin{array}{*{20}c} {S^{\infty } \ldots 0} \\ {R^{\infty } \ldots T} \\ \end{array} } \right), \\ \end{aligned} $$
(40)

which is a stable stochastic matrix and independent of the initial distribution. In addition, we also have

$$ P^{\infty } = [p_{ij} ]_{m \times m} ,\quad \left\{ {\begin{array}{*{20}l} {p_{ij} > 0,} \hfill & {(1 \le i \le m,1 \le j \le q),} \hfill \\ {p_{ij} = 0,} \hfill & {(1 \le i \le m,q < j \le m).} \hfill \\ \end{array} } \right. $$
(41)

The search algorithm will converge with almost probability one to the global optimality, starting from any initial random states, if the transition probability p to a better solution/state is \( p > 0 \). Conversely, if the transition probability p to a worse state is greater, then the algorithm will not converge.

With this main result, it has been proved that PSO will not converge to the global optimality [36], while the bat algorithm will converge to the true global optimality [35].

Huang et al. concluded that for unconstrained function optimization, the bat algorithm satisfies all the conditions for guaranteed global convergence. For nonlinear constrained problems, the bat algorithm will converge with additional initialization of orthogonal Latin squares, and has guaranteed global convergence to the true global optimality. They further concluded that

$$ S^{\infty } = (1),\quad R^{\infty } = (1,1, \ldots ,1)^{T} , $$
(42)

and

$$ P^{\infty } = \left( {\begin{array}{*{20}l} 1 \hfill & 0 \hfill & \ldots \hfill & 0 \hfill \\ 1 \hfill & 0 \hfill & \ldots \hfill & 0 \hfill \\ \vdots \hfill & \vdots \hfill & {} \hfill & \vdots \hfill \\ 1 \hfill & 0 \hfill & \ldots \hfill & 0 \hfill \\ \end{array} } \right), $$
(43)

which leads to

$$ \mathop {\lim }\limits_{t \to \infty } p\{ f(\varvec{x}{\kern 1pt} ) \to f(\varvec{x}_{*} )\} = 1. $$
(44)

That is, the global convergence is guaranteed.

Huang et al. also proposed a BA variant, called modified bat algorithm (MBA) [35], which can further improve the convergence rate with guaranteed global optimality. They also showed that this variant is suitable for large-scale, global optimization.

For the moment, most convergence studies can provide some results in terms of the long term behaviour of an algorithm during iterations; however, there are not enough results about the convergence rates to indicate how quickly an algorithm can converge and under what conditions. Obviously, more theoretical are highly needed to analyze these algorithms further.

6 Discussions and Open Problems

Despite the huge success of nature-inspired algorithms, there are still some challenging, open problems that need to be addressed. These open problems include the balance of exploration and exploitation, selection mechanisms, right amount of randomization, parameter tuning as well as parameter control, scalability and others.

  • Mathematical Framework: It still lacks a general mathematical framework for analyzing the convergence and stability of metaheuristic algorithms. There are some good results using Markov chains, dynamic systems and self-organization theory, but a systematic framework is yet to be developed.

  • Exploration and exploitation: A key problem is how to balance of exploration and exploitation in an algorithm so that it can deal with a vast range of problems efficiently [37]. In reality, the amount of exploration and exploitation may depend on the type of problem, and therefore, some a priori knowledge of the problem to be solved can help to determine such a balance. However, it is not known how to incorporate such knowledge effectively. For example, gradient/derivative information obtained from the objective function can be very useful for exploitation, but if such exploitation is too strong, it can cause the system to be trapped in a local optimum, thus sacrificing the possibility of finding the true global optimality. There may not exist such optimal balance for all problems [38].

  • Selection Mechanism: Selection mechanism is also very important and it is not known what selection is most effective. A proper selection pressure is crucial to maintain a healthy population. For example, when many solutions have similar fitness, numerically speaking, their fitness values may almost be the same, thus how to select certain solutions becomes tricky. Typical approaches include re-scaled fitness values, ranking of solutions, and adaptive elitism. However, it is not clear if they can work for all algorithms and if there is other better ways to handle selection.

  • Right Amount of Randomness: In order to balance exploration and exploitation, a right amount of randomness is needed. However, no one knows what amount is the right amount. At one extreme, if there is no randomness, an algorithm becomes a deterministic algorithm, and thus loses the ability to explore. At the other extreme, if the search is dominated by high randomness, the algorithm becomes a random search, and thus significantly reduces its ability to exploit the landscape information. In fact, it is not known how to control randomness properly so as to balance exploration and exploitation most effectively.

  • Parameter Tuning and Control: As the performance of almost any algorithm will depend on its parameter settings, how to tune these parameters to achieve the best performance is a higher level optimization problem. In fact, this is the optimization of an optimization algorithm. It is still an open question. Similarly, how to control the parameters by varying their values to achieve the best overall performance is also a key challenging issue.

  • Dynamic Landscape: The problems that have been solved in the current literature usually have fixed landscape. That is, once the problem is defined, its landscape in the search space remain unchanged. However, for dynamic problems and problems with noise, the search landscape can change with time. In such cases, adaptation can be more sophisticated and challenging. It is not clear if most current methods can still work well in such time-dependent, noisy environments.

  • Scalability: How to solve high-dimensional, large-scale problems effectively? At the moment, most case studies using metaheuristic algorithms are small-scale problems. It is not clear if these algorithms are scalable to deal with large-scale problems effectively.

Therefore, in-depth understanding and theoretical results are needed. Possible research routes may require a combination of mathematical analysis, numerical simulations, empirical observations as well as other tools such as dynamical system theories, Markov theory, self-organization theory and probability. It may even require a paradigm shift in analyzing metaheuristic algorithms. There is no doubt that any theoretical results will provide tremendous insight into understanding metaheursitic algorithms.

All these challenges can present golden opportunities for further research in analyzing adaptation and diversity in metaheuristic algorithms. It can be expected that more efficient tools may be developed to solve more complex, real-world problems with a diverse range of applications.