1 Introduction

Let us consider the following questions:

  • “Why are the peacock’s feathers so incredibly beautiful?”

  • “Why did the giraffe’s neck become so long?”

  • “If a worker bee cannot have any offspring of its own, why does it work so hard to serve the queen bee?”

If we make a serious effort to answer these mysteries, we realize that we are solving one of the problems of optimization for each species, i.e., the process of evolution of species. It is the objective of the bio-inspired method to exploit this concept to establish an effective computing system. Evolutionary computation (EC) and meta-heuristics attempt to “borrow” Nature’s methods of problem solving and have been widely applied to find solutions to optimization problems, to automatically synthesize programs, and to accomplish other AI (artificial intelligence) tasks for the sake of the effective learning and the formation of hypotheses. These methods imitate the evolutionary or biological mechanisms of living organisms to create, to combine, and to select data structures. They are widely applied in deep neural evolution fields.

In the following sections, we will explain these methodologies in details.

2 Evolutionary Algorithms: From Bullet Trains to Finance and Robots

Evolutionary computation is an engineering method that imitates the mechanism of evolution in organisms and applies this to the deforming, synthesis, and selection of data structures. Using this method, we aim to solve the problem of optimization and generate a beneficial structure. Common examples of this are the computational algorithms known as genetic algorithms (GA) and genetic programming (GP).

The basic data structures in evolutionary computation are based on knowledge of genetics. Hereafter, we shall provide an explanation of these.

The information used in evolutionary computation is formed from the two-layer structures of PTYPE and GTYPE. GTYPE (genotype, also called genetic codes, and equating to the chromosomes within the cells) are, in a genetic type analogy, a low-level, locally-regulating set. This is the evolutionary computation to be operated on, as described later. The PTYPE is a phenotype, and expresses the emergence of behavior and structures over a wide area, accompanied by development within a GTYPE environment. Fitness is determined by the PTYPE adapting to its environment, and selection relies on the fitness of the PTYPE (Fig. 1.1). For a time, the higher the fitness score taken, the better. Therefore, for individuals with a fitness of 1.0 and 0.3, the former can adapt better to their environment, and it is easier for them to survive (however, in other areas of this book, there are cases when it is better to have a smaller score).

Fig. 1.1
figure 1

GTYPE and PTYPE

We shall explain the basic framework of the evolution computation, based on the above description (Fig. 1.2). Here, we configure a set containing several dogs. We shall call this generation t. This dog has a genetic code for each GTYPE, and its fitness is determined according to the generated PTYPE. In the diagram, the fitness of each dog is shown as the value near the dog (remember that the larger the better). These dogs reproduce and create the descendants in the next generation t + 1. In terms of reproduction, the better (higher) the fitness, the more descendants they are able to create, and the worse (lower) the fitness, the easier it is for them to become extinct (in biological terminology, this refers to choice and selection). In the diagram, the elements undergoing slight change in the phenotypes due to reproduction are drawn schematically. As a result of this, the fitness of each individual in the following generation t + 1 is expected to be better than that of the previous generation. Furthermore, the fitness as seen in the set as a whole also increases. In the same way, the dogs in the generation t + 1 become parents and produce the descendants in the generation t + 2. As this is repeated and the generations progress, the set as a whole improves and this is the basic mechanism of evolutionary computation.

Fig. 1.2
figure 2

Image of evolutionary computation

In the case of reproduction, the operator shown in Fig. 1.3 is applied to the GTYPE, and produces the next generation of GTYPEs. To simplify things, here, the GTYPE is expressed as a 1-dimensional matrix. Each operator is an analogy for the genetic recombination and mutation, etc., in the organism. The application frequency and the application area of these operators are randomly determined in general.

Fig. 1.3
figure 3

Genetic operators

Normally, the following kinds of methods are used for selection.

  • Roulette selection: This is a method of selecting individuals in a ratio proportionate to their fitness. A roulette is created with an area proportionate to fitness. This roulette is spun, and individuals in the location where it lands are selected.

  • Tournament selection: Only the number of individuals from within the set (tournament size) are chosen at random and individuals from these with the highest fitness are selected. This process is repeated for the number of sets.

  • Elite strategy: Several individuals with the highest fitness are left as is to the next generation. This can prevent individuals with the highest fitness not being selected coincidentally and left to perish. This strategy is used in combination with the above two methods.

With the elite strategy, the results will not get worse in the next generation as long as the environment does not change. For this reason, it is frequently applied in engineering applications. However, note that the flip side of this is that diversity is lost.

To summarize, in the evolutionary computation, generational change is as shown in Fig. 1.4. In the figure, G is the elite rate (rate of upper level individuals that were copied and left results). We can refer to the reproduction rate as 1 − G.

Fig. 1.4
figure 4

Selection and reproduction of GA

There are many riddles remaining biologically in terms of the evolution of mimicry. Research into solving these mysteries is flourishing with the use of computer simulation using evolutional computation.

Evolutional computation is used in a variety of areas of our everyday lives. For example, the front carriage model of the Japanese N700 series bullet train plays a major role in creating original forms (Fig. 1.5a). The N700 has the performance to take curves at 270 km, speeds 20 km faster than the previous model. However, in the traditional form of the front carriage, speeding up meant that the microbarometic waves in the tunnel increased, which are a cause of noise. To solve this difficulty, the original form known as “Aero double wing” has been derived from approximately 5000 simulations using evolutionary computation. Furthermore, in the wing design of the MRJ (Mitsubishi regional jet, which is the first domestic jet in Japan), a method known as multi-objective evolutionary computation was used (Fig. 1.5b). Using this method, the two objectives of improving the fuel efficiency of passenger jet devices and reduction in noise external to the engine were optimized simultaneously, and they succeeded in improving performance compared to competing models.

Fig. 1.5
figure 5

EC applications. (a) N700 series bullet train. (b) MRJ (Mitsubishi regional jet). (c) Collaborative transportation by humanoid robots

In fields other than engineering, such as the financial field, the use of evolutionary computation methods is spreading. Investment funds are using this as a practical technology for portfolio construction and market prediction (see [10] for details). Furthermore, it has practical application in such fields as scheduling design to optimize the work shifts of nurses and allocating crews for aircraft. Another field that is using evolutionary computation is the field of evolutionary robotics. For example, Fig. 1.5c is an example of cooperative work (collaborative transportation) of evolutionary humanoid robots. Here, a learning model is used that applies co-evolution to evolutionary computation. Furthermore, module robots, which modify themselves in accordance with geographical features, environment, and work content, by combining blocks, are gaining attention. This technology is even being used by NASA (National Aeronautics and Space Administration) for researching the form of robots optimized for surveying amidst the limited environment of Mars. The form of organisms we know about may be only those species that are remaining on earth. These may be types that match the earth environment, and it is not known if these are optimal. Through evolutionary computation, if we can reproduce the process of evolution on a computer, new forms may emerge that we do not yet know about. The result of this may be the evolution of robots compatible with Mars and unknown planets (see [17]).

3 Multi-Objective Optimization

An evolutionary algorithm can take competing goals into consideration. It complies to any policies from its users regarding limits and preferences on these goals. Evolutionary algorithms that deal with multiple objectives are usually called MOEAs (Multi-Objective Evolutionary Algorithms).

Assume you are engaged in transport planning for a town [6]. The means of reducing traffic accidents range from installing traffic lights, placing more traffic signs, and regulating traffic, to setting up checkpoints (Fig. 1.6). Each involves a different cost, and the number of traffic accidents will vary with the chosen approach. Let us assume that five means (A, B, C, D, and E) are available, and that the cost and the predicted accident numbers are

$$\displaystyle \begin{aligned} \begin{array}{rcl} A &\displaystyle = &\displaystyle (2,10)\\ B &\displaystyle = &\displaystyle (4,6)\\ C &\displaystyle = &\displaystyle (8,4)\\ D &\displaystyle = &\displaystyle (9,5)\\ E &\displaystyle = &\displaystyle (7,8), \end{array} \end{aligned} $$

where the first element is the cost and the second is the predicted accident number, as plotted in Fig. 1.6. The natural impulse is to desire attainment of both goals in full: the lowest cost and the lowest predicted accident number. Unfortunately, it is not necessarily possible to attain both objectives by the same means and thus not possible to optimize both at the same time.

Fig. 1.6
figure 6

Cost vs. expected numbers of accidents

In such situations, the concept of “Pareto optimality” is useful. For a given developmental event to represent a Pareto optimal solution, it must be the case that no other developmental events exist which are of equal or greater desirability, for all evaluation functions, that is, fitness functions.

Let us look again at Fig. 1.6. Note that the points in the graph increase in desirability as we move toward the lower left. A, B, and C in particular appear to be good candidates. None of these three candidates is the best in both dimensions, that is, in both “evaluations,” but for each there is no other candidate that is better in both evaluations. Such points are called “non-dominated” points. Points D and E, in contrast, are both “dominated” by other points and therefore less desirable. E is dominated by B, as B is better than E in both evaluations:

$$\displaystyle \begin{aligned} \begin{array}{rcl} \mbox{Cost of B}(4) &\displaystyle < &\displaystyle \mbox{Cost of E}(7)\\ \mbox{Predicted accidents for B}(6) &\displaystyle < &\displaystyle \mbox{Predicted accidents for E}(8). \end{array} \end{aligned} $$

D is similarly dominated by C. In this example, therefore, the Pareto optimums are A, B, and C. As this suggests, the concept of the Pareto optimum cannot be used to select just one candidate from a group of candidates, and thus it cannot be concluded which of A, B, and C is the best.

Pareto optimality may be defined more formally as follows. Let two points x = (x 1, …, x n) and y = (y 1, …, y n) exist in an n-dimensional search space, with each dimension representing an objective (an evaluation) function, and with the objective being a minimization of each to the degree possible. The domination of y by x (written as x <py) may be, therefore, defined as

$$\displaystyle \begin{aligned} x<_py \Longleftrightarrow (\forall i)(x_i \leq y_i) \land (\exists i)(x_i < y_i). \end{aligned} $$
(1.1)

In the following, we will refer to n (the number of different evaluation functions) as the “dimension number.” Any point that is not inferior to any other point will be called “non-dominated” or “non-inferior,” and the curve (or curved surface) formed by the set of Pareto optimal solutions will be called the “Pareto front .”

The main problem that MOEAs face is how to combine the multiple objectives into a metric that can be used to perform selection. In other words, how to take into account all objectives when selecting individuals from one generation for crossover. The readers should refer to [2, 5, 13] for the studies on multi-objective optimization methods.

4 Genetic Programming and Its Genome Representation

4.1 Tree-based Representation of Genetic Programming

The aim of genetic programming (GP) is to extend genetic forms from genetic algorithm (GA) to the expression of trees and graphs and to apply them to the synthesis of programs and the formation of hypotheses or concepts. Researchers are using GP to attempt to improve their software for the design of control systems and structures for robots.

The procedures of GA are extended in GP in order to handle graph structures (in particular, tree structures). Tree structures are generally well described by S-expressions in LISP. Thus, it is quite common to handle LISP programs as “genes” in GP. As long as the user understands that the program is expressed in a tree format, then he or she should have little trouble reading a LISP program (the user should recall the principles of flow charts). The explanations below have been presented so as to be quickly understood by a reader who does not know LISP.

A tree is a graph with a structure as follows, incorporating no cycles:

More precisely, a tree is an acyclical connected graph, with one node defined as the root of the tree. A tree structure can be expressed as an expression with parentheses. The above tree would be written as follows:

(A (B)    (C (D))).

In addition, the above can be simplified to the following expression:

(A B   (C D)).

This notation is called an “S-expression” in LISP. Hereinafter, a tree structure will be identified with its corresponding S-expression. The following terms will be used for the tree structure:

  • Node: Symbolized with A, B, C, D, etc.

  • Root: A

  • Terminal node: B, D (also called a “terminal symbol” or “leaf node”)

  • Non-terminal node: A, C (also called a “non-terminal symbol” and an “argument of the S-expression”)

  • Child: From the viewpoint of A, nodes B and C are children (also, “arguments of function A”)

  • Parent: The parent of C is A

Other common phrases will also be used as convenient, including “number of children,” “number of arguments,” “grandchild,” “descendant,” and “ancestor.” These are not explained here, as their meanings should be clear from the context.

The following genetic operators acting on the tree structure will be incorporated:

  1. 1.

    Gmutation Alteration of the node label

  2. 2.

    Ginversion Reordering of siblings

  3. 3.

    Gcrossover Exchange of a subtree

These are natural extensions of existing genetic operators and act on sequences of bits. These operators are shown below in examples where they have been applied in LISP expression trees (S-expressions) (Fig. 1.7). The underlined portion of the statement is the expression that is acted upon:

Fig. 1.7
figure 7

Genetic operators in GP

Gmutation :

Parent:\((+\;x\; \underline {y})\)

    

     Child:\( (+\;x\; \underline {z})\)

Ginversion :

Parent:\((\mbox{progn}\; \underline {(\mbox{incf}\; x)\; (\mbox{setq}\; x\; 2)}\; (\mbox{print}\; x))\)

    

     Child:\((\mbox{progn}\; \underline {(\mbox{setq}\;x\; 2)\; (\mbox{incf}\; x)}\; (\mbox{print}\; x))\)

Gcrossover :

Parent1:\((\mbox{progn}\; (\mbox{incf}\; x)\; \underline {(\mbox{setq} \;x\; 2)}\;(\mbox{setq}\; y\; x))\)

     Parent2:\((\mbox{progn}\; (\mbox{decf}\; x)\;(\mbox{setq}\; x\; (*\; \underline {(\mbox{sqrt}\; x)}\; x))\;(\mbox{print}\; x))\)

    

     Child1:\((\mbox{progn}\; (\mbox{incf}\; x)\; \underline {(\mbox{sqrt}\; x)}\;(\mbox{setq}\; y\; x))\)

     Child2:\((\mbox{progn}\; (\mbox{decf}\; x)\;(\mbox{setq}\; x\; (*\; \underline {(\mbox{setq} \;x\; 2)}\; x))\;(\mbox{print}\; x))\).

Table 1.1 provides a summary of how the program was changed as a result of these operators. “progn” is a function acting on the arguments in the order of their presentation and returns the value of the final argument. The function “setq” sets the value of the first argument to the evaluated value of the second argument. It is apparent on examining this table that mutation has caused a slight change to the action of the program, and that crossover has caused replacement of the actions in parts of the programs of all of the parents. The actions of the genetic operators have produced programs that are individual children but that have inherited the characteristics of the parent programs.

Table 1.1 Program changes due to GP operators

4.2 Cartesian Genetic Programming (CGP)

CGP [18] is a genetic programming (GP) technique proposed by Miller et al. CGP represents a tree structure with a feed-forward-type network. It is a method by which all nodes are described genotypically beforehand to optimize connection relations. This is supposed to enable handling the problem of bloat, in which the tree structure becomes too large as a consequence of the number of GP genetic operations. Furthermore, by reusing a partial tree, the tree structure can be represented compactly.

The CGP network comprises three varieties of node: input nodes, intermediate nodes, and output nodes. Figure 1.8 shows a CGP configuration with n inputs, m outputs, and r × c intermediate layers. Here, connecting nodes in the same column is not permitted. CGP networks are also restricted to being feed-forward networks.

Fig. 1.8
figure 8

Example of a genotype and a phenotype

CGP uses a one-dimensional numeric string for the genotypes. These describe the functional type and connection method of the intermediate nodes, and the connection method of the output nodes. Normally, all functions have the largest argument as the input and ignore unused connections. For example, consider the following genotypes for the CGP configuration shown in Fig. 1.9.

Fig. 1.9
figure 9

Example of a genotype and a phenotype

The function symbol numbers 0, 1, 2, and 3 (underlined above) correspond to addition, subtraction, multiplication, and division, respectively. The network corresponding to the genotype at this time is as shown in Fig. 1.9. For example, the inputs of the first node 0 are input 0 and input 1, and the addition computation is functional. Note that the output of the fifth node is not used anywhere (0 4 4), making it an intron (i.e., a non-coding region).

5 Ant Colony Optimization (ACO)

Ants march in a long line. There is food at one end, a nest at the other. This is a familiar scene in gardens and on roads, but the sophisticated distributed control by these small insects was recognized by humans only a few decades ago. Marching is a cooperative ant behavior that can be explained by the pheromone trail model (Fig. 1.10, [12]).

Fig. 1.10
figure 10

Pheromone trails of ants (a) The first random search phase. (b) The closer lower right and lower left food is found, and the pheromone trail is formed. The upper left is in the middle of the formation. (c) Pheromone trails are formed for all three sources, which makes the transport more efficient. The lower right source is almost exhaustively picked. (d) The lower right food source finishes, and the pheromone trail has already dissipated. As a result, a vigorous transportation for the two sources on the left is being done

Optimization algorithms based on the collective behavior of ants are called ant colony optimization (ACO) [3]. ACO using a pheromone trail model for the TSP uses the following algorithm to optimize the travel path:

Step 1:

Ants are placed randomly in each city.

Step 2:

Ants move to the next city. The destination is probabilistically determined based on the information on pheromones and given conditions.

Step 3:

Repeat until all cities are visited.

Step 4:

Ants that make one full cycle secrete pheromones on the route according to the length of the route.

Step 5:

Return to Step1 if a satisfactory solution has not been obtained.

The ant colony optimization (ACO) algorithm can be outlined as follows. Take η ij as the distance between cities i and j (Fig. 1.11). The probability \(p_{ij}^k(t)\) that an ant k in city i will move to city j is determined by the reciprocal of the distance 1∕η ij and the amount of pheromone τ ij(t) as follows:

$$\displaystyle \begin{aligned} p_{ij}^k(t)=\frac{\tau_{ij}(t)\times \eta_{ij}^\alpha}{\sum_{h\in J_i^k}\tau_{ih}(t)\times \eta_{ih}^\alpha}. {} \end{aligned} $$
(1.2)

Here, \(J_i^k\) is the set of all cities that the ant k in city i can move to (has not visited). The condition that ants are more likely to select a route with more pheromone reflects the positive feedback from past searches as well as a heuristic for searching for a shorter path. The ACO can thereby include an appropriate amount of knowledge unique to the problem.

Fig. 1.11
figure 11

Path selection rules of ants

The pheromone table is updated by the following equations:

$$\displaystyle \begin{aligned} \begin{array}{rcl} Q(k) &\displaystyle =&\displaystyle \mbox{the reciprocal of the path that the ant}\;\; k\;\; \mbox{found}{} \end{array} \end{aligned} $$
(1.3)
$$\displaystyle \begin{aligned} \begin{array}{rcl} \varDelta \tau_{ij}(t)&\displaystyle = &\displaystyle \sum_{k\in A_{ij}}Q(k){} \end{array} \end{aligned} $$
(1.4)
$$\displaystyle \begin{aligned} \begin{array}{rcl} \tau_{ij}(t+1) &\displaystyle =&\displaystyle (1-\rho)\cdot\tau_{ij}(t)+\varDelta \tau_{ij}(t). {} \end{array} \end{aligned} $$
(1.5)

The amount of pheromone added to each path after one iteration is inversely proportional to the length of the paths that the ants found (Eq. (1.3)). The results for all ants that moved through a path are reflected in the path (Eq. (1.4)). Here, A ij is the set of all ants that moved on a path from city i to city j. Negative feedback to avoid local solutions is given as an evaporation coefficient (Eq. (1.5)), where the amount of pheromone in the paths, or information from the past, is reduced by a fixed factor (ρ).

The ACO is an effective method to solve the traveling salesman problem (TSP) compared to other search strategies. The characteristic that specialized methods perform better in static problems is shared by many meta-heuristics (high-level strategies which guide an underlying heuristic to increase their performance). Complicated problems, such as TSPs where the distances between cities are asymmetric or where the cities change dynamically, do not have established programs and the ACO is considered to be one of the most promising methods.

6 Particle Swarm Optimization (PSO)

Kennedy et al. designed an effective optimization algorithm using the mechanism behind swarming boids [16]. This is called particle swarm optimization (PSO), and numerous applications are reported.

The classic PSO was intended to be applied to optimization problems. It simulates the motion of a large number of individuals (or “particles”) moving in a multi-dimensional space [16]. Each individual stores its own location vector (x i), velocity vector (v i), and the position at which the individual obtained the highest fitness value (p i). All individuals also share information regarding the position with the highest fitness value for the group (p g).

As generations progress, the velocity of each individual is updated using the best overall location obtained up to the current time for the entire group and the best locations obtained up to the current time for that individual. This update is performed using the following formula:

$$\displaystyle \begin{aligned} \mathbf{v_i}=\chi \left(\omega \mathbf{v_i}+\phi_1\cdot (\mathbf{p_i}-\mathbf{x_i})+\phi_2\cdot (\mathbf{p_g}-\mathbf{x_i})\right). {} \end{aligned} $$
(1.6)

The overall flow of the PSO is as shown in Fig. 1.12. Let us now consider the specific movements of each individual (see Fig. 1.13). A flock consisting of a number of birds is assumed to be in flight. We focus on one of the individuals (Step 1). In the figure, the ○ symbols and linking line segments indicate the positions and paths of the bird. The nearby symbol (on its path) indicates the position with the highest fitness value on the individual’s path (Step 2). The distant symbol (on the other bird’s path) marks the position with the highest fitness value for the flock (Step 2). One would expect that the next state will be reached in the direction shown by the arrows in Step 3. Vector ① shows the direction followed in the previous steps; vector ② is directed towards the position with the highest fitness for the flock; and vector ③ points to the location where the individual obtained its highest fitness value so far. Thus, all these vectors, ①, ②, and ③, in Step 3 are summed to obtain the actual direction of movement in the subsequent step (see Step 4).

Fig. 1.12
figure 12

Flow chart of the PSO algorithm

Fig. 1.13
figure 13

In which way do birds fly?

The efficiency of this type of PSO search is certainly high because focused searching is available near optimal solutions in a relatively simple search space. However, the canonical PSO algorithm often gets trapped in local optimum in multimodal problems. Because of that, some sort of adaptation is necessary in order to apply PSO to problems with multiple sharp peaks.

To overcome the above limitation, a GA-like mutation can be integrated with PSO [8]. This hybrid PSO does not follow the process by which every individual of the simple PSO moves to another position inside the search area with a predetermined probability without being affected by other individuals, but leaves a certain ambiguity in the transition to the next generation due to Gaussian mutation. This technique employs the following equation:

$$\displaystyle \begin{aligned} mut(x)=x\times (1+gaussian(\sigma )), \end{aligned} $$
(1.7)

where σ is set to be 0.1 times the length of the search space in one dimension. The individuals are selected at a predetermined probability and their positions are determined at the probability under the Gaussian distribution. Wide-ranging searches are possible at the initial search stage and search efficiency is improved at the middle and final stages by gradually reducing the appearance ratio of Gaussian mutation at the initial stage. Figure 1.14 shows the PSO search process with Gaussian mutation. In the figure, V lbest represents the velocity based on the local best, i.e., p i −x i in Eq. (1.6), whereas V gbest represents the velocity based on the global best, i.e., p g −x i.

Fig. 1.14
figure 14

Concept of searching process by PSO with Gaussian mutation

7 Artificial Bee Colony Optimization (ABC)

Bees, along with ants, are well-known examples of social insects. Bees are classified into three types:

  • employed bees

  • onlooker bees

  • and scout bees

Employed bees fly in the vicinity of feeding sites they have identified, sending information about food to onlooker bees. Onlooker bees use the information from employed bees to perform selective searches for the best food sources from the feeding site. When information about a feeding site is not updated for a given period of time, its employed bees abandon it and become scout bees that search for a new feeding site. The objective of a bee colony is to find the highest-rated feeding sites. The population is approximately half employed bees and scout bees (about 10–15% of the total), the rest are onlooker bees.

The waggle dance (a series of movements) performed by employed bees to transmit information to onlooker bees is well known (Fig. 1.15). The dance involves shaking the hindquarters and indicating the angle with which the sun will be positioned when flying straight to the food source, with the sun represented as straight up. For example, a waggle dance performed horizontally and to the right with respect to the nest combs means “fly with the sun at 90 to the left.” The speed of shaking the rear indicates the distance to the food; when the rear is shaken quickly, the food source is very near, and when shaken slowly it is far away. Communication via similar dances is also performed with regard to pollen and water collection, as well as the selection of locations for new hives.

Fig. 1.15
figure 15

Waggle dance

The artificial bee colony (ABC) algorithm [14, 15] initially proposed by Karaboga et al. is a swarm optimization algorithm that mimics the foraging behavior of honey bees. Since ABC was designed, it has been proved that ABC, with fewer control parameters, is very effective and competitive with other search techniques such as genetic algorithm (GA), particle swarm optimization (PSO), and differential evolution (DE [9]).

In ABC algorithms, an artificial swarm is divided into employed bees, onlooker bees, and scouts. N d-dimensional solution candidates to the problem are randomly initialized in the domain and referred to as food sources. Each employed bee is assigned to a specific food source x i and searches for a new food source v i by using the following operator:

$$\displaystyle \begin{aligned} {\mathbf{v}}_{ij}={\mathbf{x}}_{ij}+\mbox{rand}(-1,1)\times ({\mathbf{x}}_{ij}-{\mathbf{x}}_{kj}), {} \end{aligned} $$
(1.8)

where k ∈{1, 2, …, N}, k ≠ i, and j ∈{1, 2, …, d} are randomly chosen indices. v ij is the jth element of the vector v i. If the trail to a food source is outside of the domain, it is reset to an acceptable value. The obtained v i is then evaluated and put into competition with x i for survival. The bee prefers the better food source. Unlike employed bees, each onlooker bee chooses a preferable source according to the food source’s fitness to do further searches in the food space using Eq. (1.8). This preference scheme is based on the fitness feedback information from employed bees. In classic ABC [14], the probability of the food source x i that can be exploited is expressed as

$$\displaystyle \begin{aligned} p_i=\frac{fit_i}{\sum_{j=1}^{N}fit_j}, \end{aligned} $$
(1.9)

where fit i is the fitness of the ith food source, x i. For the sake of simplicity, we assume that the fitness value is non-negative and that the larger, the better. If the trail v i is superior to x i in terms of profitability, this onlooker bee informs the relevant employed bee associated with the ith food source, x i, to renew its memory and forget the old one. If a food source cannot be improved upon within a predetermined number of iterations, defined as Limit, this food source is abandoned. The bee that was exploiting this food site becomes a scout and associates itself with a new food site that is chosen via some principle. In canonical ABC [14], the scout looks for a new food site by random initialization.

The details of the ABC algorithm are described as below. The pseudocode of the algorithm is shown in Algorithm 1.

Algorithm 1 The ABC algorithm

Step 0: Preparation :

The total number of search points (N), total number of trips (\(T_{\max }\)), and a scout control parameter (Limit) are initialized. The numbers of employed bees and onlooker bees are set to be the same as the total number of search points (N). The value of the objective function f is taken to be non-negative, with larger values being better.

Step 1: Initialization 1 :

The trip counter k is set to 1, and the number of search point updates s i is set to 0. The initial position vector for each search point x i = (x i1, x i2, x i3, …, x id)T is assigned random values. Here, the subscript i (i = 1, …, N) is the index of the search point, and d is the number of dimensions in the search space.

Step 2: Initialization 2 :

Determine the initial best solution best.

$$\displaystyle \begin{aligned} \begin{array}{rcl} i_g&\displaystyle =&\displaystyle \mathop{\text{argmax}}_if ( {\mathbf{x}}_i ) \end{array} \end{aligned} $$
(1.10)
$$\displaystyle \begin{aligned} \begin{array}{rcl} \mathbf{best}&\displaystyle =&\displaystyle x_{i_g}. \end{array} \end{aligned} $$
(1.11)
Step 3: Employed bee search :

The following equation is used to calculate a new position vector v ij from the current position vector x ij:

$$\displaystyle \begin{aligned} {\mathbf{v}}_{ij}={\mathbf{x}}_{ij}+\phi \cdot({\mathbf{x}}_{ij}-{\mathbf{x}}_{kj}). \end{aligned} $$
(1.12)

Here, j is a randomly chosen dimensional number, k is the index for some randomly chosen search point other than i, and ϕ is a uniform random number in the range [−1, 1]. The position vector x i and the number of search point updates s i are determined according to the following equation:

$$\displaystyle \begin{aligned} \begin{array}{rcl} I&=&\{i\mid f({\mathbf{x}}_i)<f({\mathbf{v}}_i)\} \end{array} \end{aligned} $$
(1.13)
$$\displaystyle \begin{aligned} \begin{array}{rcl} {\mathbf{x}}_i&=& \left\{ \begin{array}{ll} {\mathbf{v}}_i & i\in I\\ {\mathbf{x}}_i & i\notin I\\ \end{array} \right. \end{array} \end{aligned} $$
(1.14)
$$\displaystyle \begin{aligned} \begin{array}{rcl} s_i&=& \left\{ \begin{array}{ll} 0 & i\in I\\ s_i+1 & i\notin I.\\ \end{array} \right. \end{array} \end{aligned} $$
(1.15)
Step 4: Onlooker bee search :

The following two steps are performed.

  1. 1.

    Relative ranking of search points

    The relative probability P i is calculated from the fitness fit i, which is based on the evaluation score of each search point. Note that fit i = f(x i). The onlooker bee search counter l is set to 1.

    $$\displaystyle \begin{aligned} \begin{array}{rcl} P_i&\displaystyle =&\displaystyle \frac{fit_i}{\sum_{j=1}^{N}fit_j}. \end{array} \end{aligned} $$
    (1.16)
  2. 2.

    Roulette selection and search point updating

    Search points are selected for updating based on the probability P i, calculated above. After search points have been selected, perform a procedure as in Step 3 to update the search point position vectors. Then, let l = l + 1 and repeat until l = N.

Step 5: Scout bee search :

Given a search point for which s i ≥Limit, random numbers are used to exchange generated search points.

Step 6: Update best solution :

Update the best solution best.

$$\displaystyle \begin{aligned} \begin{array}{rcl} i_g &\displaystyle =&\displaystyle \mathop{\text{argmax}}_i f({\mathbf{x}}_i) \end{array} \end{aligned} $$
(1.17)
$$\displaystyle \begin{aligned} \begin{array}{rcl} \mathbf{best} &\displaystyle =&\displaystyle {\mathbf{x}}_{i_g}\;\; \mbox{when}\;\; f(x_{i_g}) >f (\mathbf{best}). \end{array} \end{aligned} $$
(1.18)
Step 7: End determination :

End if \(k = T_{\max }\). Otherwise, let k = k + 1 and return to Step 3.

ABC has recently been improved in many aspects. For instance, we analyzed the mechanism of ABC to show a possible drawback of using parameter perturbation. To overcome this deficiency, we have proposed a new non-separable operator and embedded it in the main framework of the cooperation mechanism of bee foraging (see [7] for details).

8 Firefly Algorithms

Fireflies glow owing to a luminous organ and fly around. This glow is meant to attract females. The light generated by each firefly differs depending on the individual insect, and it is considered that they attract others following the rules described below:

  • The extent of attractiveness is in proportion to the luminosity.

  • Female fireflies are more strongly attracted by males that produce a strong glow.

  • Luminosity decreases as a function of distance.

The firefly algorithm (FA) is a search method based on blinking fireflies [22]. This algorithm does not discriminate gender. That is, all fireflies are attracted to each other. In this case, the luminosity is determined by an objective function. To solve the minimization problem, fireflies at a lower functional value (with a better adaptability) glow much more strongly. The most glowing firefly moves around at random.

Algorithm 2 Firefly algorithm

Algorithm 2 describes the outline of the FA. The moving formula for a firefly i attracted by firefly j is as follows:

$$\displaystyle \begin{aligned} {\mathbf{x}}^{new}_i = {\mathbf{x}}^{old}_i +\beta_{i,j}\left(\mathbf{x_j}-{\mathbf{x}}^{old}_i \right) +\alpha\left(rand(0,1) -\frac{1}{2}\right), \end{aligned} $$
(1.19)

where rand(0, 1) is a uniform random numbers between 0 and 1. α is a parameter to determine the magnitude of the random numbers, and β i,j represents how attractive firefly j is to firefly i, i.e.,

$$\displaystyle \begin{aligned} \beta_{i,j}=\beta_0e^{-\gamma r^2_{i,j}}. \end{aligned} $$
(1.20)

The variable β 0 represents how attractive fireflies are when r i,j = 0, which indicates that the two are in the same position. Since r i,j represents the Euclid distance between firefly i and j, their attractiveness varies depending on the distance between them.

The most glowing firefly moves around at random, according to the following formula:

$$\displaystyle \begin{aligned} \mathbf{x_k}(t+1) =\mathbf{x_k}(t) +\alpha \left(rand(0,1) -\frac{1}{2}\right). \end{aligned} $$
(1.21)

The reason for this is that the entire population converges to the locally best solution in an initial allocation.

As the distance becomes greater, the attractiveness becomes weaker. Therefore, under the firefly algorithms, fireflies form groups with each other at a distance instead of gathering at one spot.

The firefly algorithms are suitable for optimization problems on multimodality and are considered to yield better results compared to those obtained using PSO. It has another extension that separates fireflies into two groups and limits the effect on those in the same group. This enables global solutions and local solutions to be searched simultaneously.

9 Cuckoo Search

The cuckoo search (CS) [23] is meta-heuristics based on brood parasitic behavior. Brood parasitism is an animal behavior in which an animal depends on a member of another species (or induces this behavior) to sit on its eggs. Some species of cuckoos are generally known to exhibit this behavior. They leave their eggs in the nests of other species of birds such as the great reed warblers, Siberian meadow buntings, bullheaded shrikes, azure-winged magpies, etc.Footnote 1 Before leaving, they demonstrate an interesting behavior referred to as egg mimicry: they take out one egg of a host bird (foster parent) already in the nest and lay an egg that mimics the other eggs in the nest, thus keeping the numbers balanced.Footnote 2 This is because a host bird discards an egg when it determines that the laid egg is not its own.

A cuckoo chick has a remarkably large and bright bill; therefore, it is excessively fed by its foster parent. This is referred to as “supernormal stimulus.” Furthermore, there is an exposed skin region at the back of the wings with the same color as its bill. When the foster parent carries foods, the chick spreads its wings to make the parent aware of the region. The foster parent mistakes it for its own chicks. Thus, the parent believes that it has more chicks to feed than it actually has and carries more food to the nest. It is considered to be an evolutional strategy for cuckoos to be fed corresponding to their size because a grown cuckoo is many times larger than the host.

The CS models the cuckoos’ brood parasitic behavior based on three rules as described below:

Algorithm 3 Cuckoo search

  • A cuckoo lays one egg at a time and leaves it in a randomly selected nest.

  • The highest quality egg (difficult to be noticed by the host bird) is carried over to the next generation.

  • The number of nests is fixed, and a parasitized egg is noticed by a host bird with a certain probability. In this case, the host bird either discards the egg or rebuilds the nest.

Algorithm 3 shows the CS algorithm. Based on this algorithm, a cuckoo lays a new egg in a randomly selected nest, according to Lévy flight. This flight presents mostly a short distance random walk with no regularity. However, it sometimes exhibits a long-distance movement. This movement has been identified in several animals and insects. It is considered to be able to represent stochastic fluctuations observed in various natural and physical phenomena such as flight patterns, feeding behaviors, etc.

Specifically, Lévy distribution is represented by the following probability density function referred to in Fig. 1.16:

$$\displaystyle \begin{aligned} f(x;\mu,\sigma)=\begin{cases} \sqrt{\frac{\sigma}{2\pi}} \exp\Bigl[-\frac{\sigma}{2(x-\mu)}\Bigr](x-\mu)^{-3/2} & (\mu< x),\\ 0 & (\mbox{otherwise}), \end{cases} \end{aligned} $$
(1.22)

where μ represents a positional parameter, and σ represents a scale parameter. Based on this distribution, Lévy flight mostly presents a short distance movement, while it also presents a random walk for a long-distance movement with a certain probability. For optimization, it facilitates an effective search compared to using random walk (Gaussian flight) according to a regular distribution [23].

Fig. 1.16
figure 16

Lévy distribution

Let us consider an objective function represented as f(x), x = (x 1, …, x d)T. A cuckoo then creates a new solution candidate for the nest i given by the following equation:

(1.23)

where α(> 0) is related to the scale of the problem. In most cases, α = 1. The operation ⊗ represents multiplication of each element by α. Lévy(λ) represents a random number vector whereby each element follows a Lévy distribution, and this is accomplished as follows:

(1.24)

where rand(0, 1) is a uniform random number between 0 and 1. This formula is essentially a random walk with the distribution achieved by powered steps with a heavy tail. Therefore, it includes infinite averages and infinite standard deviation. An exponential distribution with exponents from − 1 to − 3 is normally used for a long-distance movement of Lévy flight.

p a represents a parameter referred to as the switching probability, and its fraction of the worse nests are abandoned from the nest by a host bird and new ones are built by performing Lèvy flights. This probability strikes a balance between exploration and exploitation.

CS is considered to be robust, compared with PSO and ACO [1].

10 Harmony Search (HS)

Harmony search (HS) [4] is a meta-heuristic based on jazz session (generation process of human improvisation). Musicians are considered to perform improvisation mainly using any one of the methods as outlined below:

  • Use already-known scales (stored in their memory).

  • Partially change or modify the already-known scales. Play scales next to the one stored in their memory.

  • Create new scales. Play random scales within their playable area.

A process whereby musicians combine various scales in their memory for the purpose of composition is regarded as a sort of optimization. While many meta-heuristics are based on swarm intelligence of life such as fish, insects, etc., HS significantly differs from them in terms of exploiting ideas from musical processes to search for harmony, according to an aesthetic standard.

Harmony search algorithms (referred to as “HS,” hereinafter) search for the optimum solution by imitating a musician’s processes according to the following three rules:

  • Select an arbitrary value from HS memory.

  • Select a value next to the arbitrary one from HS memory.

  • Select a random value within a selectable range.

With HS, a solution candidate vector is referred to as a harmony, and a set of solution candidates is referred to as a harmony memory (HM). Solution candidates are replaced within HM by a specific order. This process is repeated a certain number of times (or until the conditions for termination are met), and finally, the best harmony is selected among those that survive in HM as a final solution.

Algorithm 4 shows a harmony search algorithm. HMCR (Harmony Memory Considering Rate) represents the probability of selecting a harmony from HM, while PAR (Pitch Adjust Rate) represents the probability of amending a harmony selected from HM. HMS is the number of harmonies (sets), which is normally set to between 50 and 100.

A new solution candidate (harmony) is generated from HM based on HMCR. HMCR is the probability of selecting component elementsFootnote 3 among the present HM. Thus, new elements are randomly generated by the probability of 1 − HMCR. Subsequently, mutation occurs according to the probability of PAR. The bw parameter (Bandwidth) represents the largest size of the mutation. In case a newly generated solution candidate (harmony) is better than the poorest solution of HM, they are replaced.

This method is also similar to a genetic algorithm (GA); however, it differs in that all the members of HM become a parent candidate in HS while only one or two existing range(s) of chromosomes (parent individual) is/are used to generate a child chromosome in GA.

Algorithm 4 Harmony search

The coefficients employed here are the convergence coefficient χ (a random value between 0.9 and 1.0) and the attenuation coefficient ω, while ϕ 1 and ϕ 2 are random values unique to each individual and the dimension, with a maximum value of 2. When the calculated velocity exceeds some limit, it is replaced by a maximum velocity \(V_{\max }\). This procedure allows us to hold the individuals within the search region during the search.

The locations of each of the individuals are updated at each generation by the following formula:

$$\displaystyle \begin{aligned} \mathbf{x_i}=\mathbf{x_i}+\mathbf{v_i}. {} \end{aligned} $$
(1.25)

11 Conclusion

This chapter introduced several methods of evolutionary computation and meta-heuristics used in deep neural evolution.

In concluding, we will describe some critical opinions against meta-heuristics and further discussions.

Meta-heuristics frequently uses unusual names, terms associating with nature, and metaphors. For example, in the harmony search, the following terms are used:

  • harmony,

  • pitch, note,

  • sounds better.

However, these are just saying the following words listed below in another way:

  • solution,

  • decision variable,

  • has a better objective function value.

Although critics insist that these replaced words cause confusion [19, 20], it is considered that use of metaphorical expressions itself does not influence the ease of understanding. For example, David HilbertFootnote 4 is quoted as saying geometry does work even if a point, line, and face are expressed as a table, chair, and beer mug in discussing mathematical forms. That is to say, it does not matter at all in precise discussions when it is axiomatically defined. Nevertheless, we should be careful about insisting on the novelty of meta-heuristics. It is important to recognize the distinct difference with existing methods for further discussions. Wayland has criticized the harmony search as simply a special example of evolution strategy (μ + 1)ES [21].Footnote 5

It does matter that researchers who propose meta-heuristics do not recognize other similar methods [19]. When developing a new method, we never fail to have discussions on the basis of past investigations (see [11] for further discussion).