1 Introduction

Interest in bilevel optimization has been growing due to a number of new applications that are arising in different fields of science and engineering. Bilevel programming is quite common in the area of defense where these problems are studied as attacker-defender problems. The problem was introduced by Bracken and McGill (1973) in the area of mathematical programming, where an inner optimization problem acts as a constraint to an outer optimization problem. One of the follow-up papers by Bracken and McGill (1974) highlighted the applications of bilevel programming in defense. Since then a number of studies on homeland security (Brown et al. 2005; Wein 2009; An et al. 2013) have been performed, where it is common to have bilevel, trilevel and even multilevel optimization models. In the area of operations research, bilevel optimization is gaining importance in the context of interdiction and protection of hub-and-spoke networks (Lei 2013), as most of the critical infrastructures like transportation and communications are predominantly hub-and-spoke. In other game theoretic settings, bilevel optimization has been used in transportation (Migdalas 1995; Constantin and Florian 1995; Brotcorne et al. 2001), optimal tax policies (Labbé et al. 1998; Sinha et al. 2013, 2015), investigation of strategic behavior in deregulated markets (Hu and Ralph 2007), model production processes (Nicholls 1995) and optimization of retail channel structures (Williams et al. 2011). The applications extend to a variety of other domains, like, facility location (Jin and Feng 2007; Uno et al. 2008; Sun et al. 2008), chemical engineering (Smith and Missen 1982; Clark and Westerberg 1990), structural optimization (Bendsoe 1995; Christiansen et al. 2001), and optimal control (Mombaur et al. 2010; Albrecht et al. 2011) problems. While new applications that are inherently bilevel in nature are arising at a fast pace, the development of computationally efficient algorithms for such problems has not kept the pace with the applications.

A significant body of literature exists on bilevel optimization and its optimality conditions (Lignola and Morgan 2001; Dempe 2002; Dempe et al. 2007, 2014; Wiesemann et al. 2013) in the classical optimization literature. However, on the algorithm front most attention has been given to only simple instances of bilevel optimization where the objective functions and constraints are linear (Wen and Hsu 1991; Ben-Ayed 1993), quadratic (Bard and Moore 1990; Edmunds and Bard 1991; Al-Khayyal et al. 1992) or convex (Liu et al. 1998). This is not surprising given the fact that bilevel optimization is difficult to an extent that merely evaluating the bilevel optimality of a given solution is an NP-hard task (Vicente et al. 1994). Researchers have also attempted to solve these problems using computational techniques like evolutionary algorithms. Most of the bilevel algorithms relying on evolutionary framework have been nested in nature (Mathieu et al. 1994; Yin 2000; Li and Wang 2007; Zhu et al. 2006; Sinha et al. 2014; Islam et al. 2017b, a). One of the drawbacks of such an approach is that it might be able to solve small instances of bilevel problems, but as soon as the problem scales-up beyond a few variables, the computational requirements increase tremendously. However, the evolutionary algorithms still have a niche in solving these problems as it maintains a population at each iteration of the algorithm. A population of points may allow modeling various mappings in bilevel optimization to reduce the computational expense (Sinha et al. 2016a). Some studies in this direction are (Sinha et al. 2016b, 2017, 2013, 2014). We believe that exploiting some of the mathematical properties of bilevel problems through modeling of various mappings in bilevel is the way forward in solving such problems. For a detailed review on bilevel optimization the readers may refer to Sinha et al. (2018), Dempe (2002), and Bard (1998)

In this paper, we focus on two important mappings in bilevel optimization borrowed from the mathematical optimization literature. The first mapping is the lower level reaction set mapping (known as \(\Psi \)-mapping), which provides the lower level optimal solution(s) corresponding to any given upper level vector. Considering the upper level problem as the leader’s problem and the lower level problem as the follower’s problem, the reaction set mapping represents the rational decisions of the follower corresponding to any decision taken by the leader. The second mapping is the lower level value function mapping (known as \(\varphi \)-mapping) that provides the optimal objective function value to the follower’s problem for any given leader’s decision. While the first mapping can be a set-valued mapping, the second mapping is always single-valued. We work with meta-modeling techniques that try to approximate these two mappings and develop a computationally efficient evolutionary algorithm for solving bilevel problems. The algorithm has been tested on a number of test problems, and the computational gain when compared with other techniques is found to be significant. In this paper, we also extend an existing test-suite of bilevel test problems (Sinha et al. 2014) with a couple of additional problems to better evaluate our proposed solution procedure.

The paper is organized as follows. To begin with, we provide a brief literature survey of bilevel optimization using evolutionary algorithms. This is followed by various formulations of the bilevel optimization problem and discussion of the two mappings that we approximate in this paper. Thereafter, we provide the bilevel evolutionary optimization algorithm which is an extension of the algorithm proposed in the previous studies (Sinha et al. 2017, 2013, 2014). Following this, we provide the empirical results on a number of test problems. A comparative study with other approaches is also included. Finally, we end the paper with the conclusions section.

2 A survey on evolutionary bilevel optimization

Most of the evolutionary algorithms for bilevel optimization are nested in nature, where one optimization algorithm is used within the other. The outer algorithm handles the upper level problem and the inner algorithm handles the lower level problem. Such a structure necessitates that the inner algorithm is called for every upper level point generated by the outer algorithm. Therefore, nested approaches can be quite computationally demanding, and can only be applied to small scale problems. One can find studies with evolutionary algorithm being used for the upper level problem and classical approach being used for the lower level problem. If the lower level problem is complex, researchers have used evolutionary algorithms at both levels. Below we provide a review of evolutionary bilevel optimization algorithms from the past.

Mathieu et al. (1994) was one of the first to propose a bilevel algorithm using evolutionary algorithms. He used a genetic algorithm to handle the upper level problem and linear programming to solve the lower level problem for every upper level member generated using genetic operations. This study was followed by nesting the Frank-Wolfe algorithm (reduced gradient method) within a genetic algorithm in Yin (2000). Other authors utilized similar nested schemes in Li et al. (2006), Li and Wang (2007), and Zhu et al. (2006). Studies involving evolutionary algorithms at both levels include (Angelo et al. 2013; Angelo and Barbosa 2015), where authors have used differential evolution at both levels in the first study, and differential evolution within ant colony optimization in the second study.

Replacing the lower level problem in bilevel optimization with its KKT conditions is a common approach for solving the problem both in classical as well as evolutionary computation literature. However, a KKT based reduction can only be applied to problems where the lower level is convex and adheres to certain regularity conditions (Mirrlees 1999). Some of the past evolutionary studies that utilize this idea include (Hejazi et al. 2002; Wang et al. 2005). The approach has been popular and even recently researchers are relying on reducing the bilevel problem into single level problem using KKT and solving the reduced problem using evolutionary algorithm, for example, see Wang et al. (2011), Jiang et al. (2013), Li (2015), and Wan et al. (2013).

While KKT conditions can only be applied to problems where the lower level adheres to certain mathematically simplifying assumptions, the researchers are exploring techniques that can solve more general instances of bilevel optimization problems. Some of the approaches are based on meta-modeling the mappings within bilevel optimization, while others may be based on meta-modeling the entire bilevel problem itself. Studies in this direction include (Sinha et al. 2017, 2013, 2014). In this paper, we aim to develop an algorithm that tries to capture two important mappings in bilevel optimization; namely, the lower level reaction set mapping and the lower level value function mapping, in order to reduce the computational complexity of the problem.

3 Different bilevel formulations

We will start this section by providing a general formulation for bilevel optimization. This is followed by various proposals that researchers have made for reducing a bilevel problem into a single-level problem. The two levels in a bilevel problem are also known as the leader’s (upper) and follower’s (lower) problems in the domain of game theory. In general, the variables, objectives and constraints are different for the two levels. The upper level variables are treated as parameters while optimizing the lower level problem. A general bilevel formulation has been provided below (for brevity, we ignore equality constraints):

Definition 1

For the upper-level objective function \(F:\mathbb {R}^n\times \mathbb {R}^m \rightarrow \mathbb {R}\), lower-level objective function \(f:\mathbb {R}^n\times \mathbb {R}^m \rightarrow \mathbb {R}\), upper level variable \(x_u\in \mathbb {R}^n\) and lower level variable \(x_l\in \mathbb {R}^m\), the bilevel optimization problem is given by

where \(G_k:\mathbb {R}^n\times \mathbb {R}^m \rightarrow \mathbb {R}\), \(k=1,\ldots ,K\) denotes the upper level constraints, and \(g_j:\mathbb {R}^n\times \mathbb {R}^m \rightarrow \mathbb {R}\) denotes the lower level constraints.

There are two common positions that a user assumes while solving a bilevel optimization problem; namely, optimistic and pessimistic positions. The bilevel formulation in Definition 1 is straightforward, whenever there is a single optimal solution for the lower level problem for any given upper level variable. However, for scenarios with more than one lower level optimal solutions for some upper level variables, one has to be clear that which of the many optimal solutions from the lower level be considered as the response of the follower. Optimizing bilevel problems from either optimistic or pessimistic position is useful to handle the ambiguity arising from multiple lower level optimal solutions. In an optimistic position, it is assumed that the lower level chooses that optimal solution which is favorable at the upper level. In a pessimistic position, the upper level optimizes its problem according to the worst case scenario. In other words, the lower level may choose a solution from the optimal set that is least favorable at the upper level. In this paper, we assume an optimistic position while solving bilevel optimization problems.

In case when certain mathematically simplifying assumptions like continuities and convexities are satisfied, often the lower level optimization task in Definition 1 is replaced with its KKT conditions. However, the reduced formulation is not simple to handle, as it induces non-convexities and discreteness into the problem through the complementary slackness conditions. We do not utilize any properties of the KKT based reduction in this paper, rather we focus on two different formulations in the development of the evolutionary algorithm in this paper.

3.1 Lower level reaction set mapping

The formulation provided in Definition 1 can also be stated as follows:

Definition 2

Let \(\Psi :\mathbb {R}^n\rightrightarrows \mathbb {R}^m\) be the reaction set mapping,

$$\begin{aligned} \Psi (x_u)=&\mathop {\mathrm{argmin}}\limits _{x_l}\{f(x_u,x_l) : g_j(x_u,x_l)\le 0, j=1,\ldots ,J\}, \end{aligned}$$

which represents the constraint defined by the lower-level optimization problem, i.e. \(\Psi (x_u)\) for every \(x_u\). Then the following gives an alternative formulation for the bilevel optimization problem:

Using the above definition, a bilevel problem can be reduced to a single level constrained problem given that the \(\Psi \)-mapping can somehow be determined. Unfortunately this is rarely the case. Studies in the evolutionary computation literature that rely on iteratively approximation of this mapping to reduce the lower level optimization calls could be found in Sinha et al. (2017), Sinha et al. (2013), and Sinha et al. (2014). To illustrate the idea, let’s consider the Fig. 1. To acquire sufficient data for constructing the \(\Psi \)-mapping approximation, a few lower level problems need to be optimized completely for their corresponding upper level decision vectors in the beginning. For instance, the lower level decisions for the upper level decisions abcde and f are determined by optimizing the lower level problem, which are then used to locally approximate the \(\Psi \)-mapping. This has been shown in Fig. 1. Even though the actual \(\Psi \)-mapping is still unknown, the local approximation can then be substituted to identify the lower level optimal decision for every new upper level member to avoid the lower level optimization task. This procedure of approximating the mapping and utilizing it to predict the lower level optimum needs to be repeated iteratively until convergence to the bilevel optimum. The idea works well when the \(\Psi \)-mapping is single valued. If the lower level has multiple optimal solutions for some upper level members as shown in Fig. 2, then identifying as well as approximating the mapping is not a straightforward task.

Fig. 1
figure 1

Solving the lower level optimization problem completely for abcde and f provides the corresponding lower level optimal members \(\Psi (a), \Psi (b), \Psi (c), \Psi (d), \Psi (e)\) and \(\Psi (f)\), where \(\Psi \)-mapping is assumed to be single valued. Such a mapping can be approximated

Fig. 2
figure 2

A scenario where the the \(\Psi \)-mapping is set-valued in some regions and single-valued in other regions. If the \(\Psi \)-mapping is set-valued then identifying as well as approximating the mapping is not a straightforward task

3.2 Lower level optimal value function mapping

Another formulation for the bilevel optimization problem in Definition 1 can be written using the optimal lower level value function (Ye and Zhu 2010; Outrata 1988, 1990):

Definition 3

Let \(\varphi : \mathbb {R}^n \rightarrow R\) be the lower level optimal value function mapping,

$$\begin{aligned} \varphi (x_u)=\mathop {\mathrm{min}}\limits _{x_l} \{f(x_u,x_l): g_j(x_u,x_l)\le 0, j=1,\ldots ,J \}, \end{aligned}$$

which represents the optimal function value at the lower level for any given upper level decision vector. Using this lower level optimal value function, the bilevel optimization problem can be expressed as:

Note that the constraint \(f(x_u,x_l) \le \varphi (x_u)\) in the above definition says that the value of the lower level function \(f(x_u,x_l)\) should always be less than or equal to the optimal lower level function value, given by \(\varphi (x_u)\), corresponding to any \(x_u\). This along with the lower level constraints ensure that the above definition incorporates the lower level optimality requirements.

As in the case of \(\Psi \)-mapping, if the \(\varphi \)-mapping can be somehow determined, a bilevel problem can be reduced to a single level problem as described in Definition 3. Along the process of an algorithm, the \(\varphi \)-mapping can be approximated and used to solve the reduced single level problem formulation in an iterative manner. Such an evolutionary algorithm has been recently discussed in Sinha et al. (2016b). The approximation of the optimal value function (\(\varphi \)) mapping is, in general, less complicated than the reaction set (\(\Psi \)) mapping, in the sense that, the \(\varphi \)-mapping is always scalar-valued regardless of the lower level variable dimension and whether or not there exist multiple lower level optimal solutions (Fig. 3). However, the \(\varphi \)-mapping based reduction is not necessarily always better than the \(\Psi \)-mapping based reduction. Definition 3 requires the problem to be solved with respect to upper as well as lower level variables, while in Definition 2 the lower level variables are readily available from the \(\Psi \)-mapping. The \(\Psi \)-mapping based reduction also contains fewer constraints. Therefore, clearly there is a trade-off.

Fig. 3
figure 3

An example showing a \(\varphi \)-mapping

It is noteworthy that the lower level optimization problem is a parametric optimization problem that is solved with respect to the lower level variables, while the upper level variables act as parameters. Therefore, for bilevel problems with mathematically well behaved objective functions and constraints, it is possible to utilize ideas from studies on sensitivity analysis and parametric optimization to identify the mappings in bilevel optimization. Whenever such a mapping can be directly obtained using the parametric optimization tools, the bilevel problem can be readily reduced to a single level problem and standard mathematical programming algorithms can be applied. For related work, the readers may refer to Jittorntrum (1984), Fiacco and McCormick (1990), and Ralph and Dempe (1995).

Table 1 Standard test problems TP1–TP5. (Note that \(x = x_u\) and \(y = x_l\))
Table 2 Standard test problems TP6-TP8. (Note that \(x = x_u\) and \(y = x_l\))

4 Evaluating the performance of \(\Psi \) and \(\varphi \) mappings on test problems

In this section, we implement the \(\Psi \) and \(\varphi \) mappings separately in two different nested algorithms to evaluate the advantages and disadvantages of using the two mappings as a local search. For evaluating the two mappings, we choose a set of simple test problems that are provided in Tables 1 and 2. Firstly, we create a nested algorithm that utilizes an evolutionary approach for solving the upper level problem and sequential quadratic programming (SQP) for solving the lower level problem. Most of the lower level problems in the considered test cases being convex, explains the choice for using sequential quadratic programming (SQP) at the lower level. We enhance the nested approach by allowing it to approximate the \(\Psi \) and \(\varphi \) mappings and measure the performance gain provided by using each of the mappings separately. The implementation of the approaches has been outlined through the Fig. 4. The flowchart without the overlapping box provides the steps involved in the nested approach. In case the idea involving \(\Psi \) and \(\varphi \) mappings has to be used, then the local search (as mentioned in the overlapping box) is conducted every k generations of the nested algorithm after the update step. A detailed description of the nested algorithm has been provided below.

  1. 1.

    Create a random population of size N comprising of upper level variables

  2. 2.

    Solve the lower level optimization problem using SQP for each upper level variable.

  3. 3.

    Evaluate the fitness of each population member using upper level function and constraints (refer to Sect. 6.2)

  4. 4.

    Choose \(2\mu \) population members using tournament selection and apply genetic operators (refer to Sect. 6.3) to produce \(\lambda \) offspring.

  5. 5.

    Solve the lower level optimization problem using SQP for each offspring.

  6. 6.

    Evaluate the fitness of each offspring using upper level function and constraints

  7. 7.

    Form a pool consisting of \(\rho +\lambda \) members, where \(\rho \) members are chosen randomly from the population and \(\lambda \) members are the offspring. Use the best \(\rho \) members from this pool to replace the chosen r members from the population.

  8. 8.

    Perform a termination check (refer to Sect. 6.5) and proceed to Step 5 if termination check is false, otherwise stop.

The parameters used in the implementation of the above procedure are \(N=50, \mu =2, \lambda =3\) and \(rho=2\).

Fig. 4
figure 4

Nested approach with evolutionary algorithm at upper level (UL) and SQP at lower level (LL). Local search based on \(\Psi \) or \(\varphi \) mapping may be performed to make the nested approach faster

4.1 Approximating the \(\Psi \)-mapping

Let \(\mathcal {H}\) be the hypothesis space. The hypothesis space consists of all functions that can be used to generate a mapping between the upper level decision vectors and optimal lower level decision vectors. Given a sample consisting of upper level points and corresponding optimal lower level points, we would like to identify a model \(\hat{\Psi }\in \mathcal {H}\) that minimizes the empirical error on the sample, i.e.

$$\begin{aligned} \hat{\psi }=\mathop {\mathrm{argmin}}\limits _{h \in \mathcal {H}}\sum _{i\in \mathcal {I}} L(h(x_{u}^{(i)}),\bar{x}_{l}^{(i)}), \end{aligned}$$
(1)

where \(L:\mathbb {R}^m\times \mathbb {R}^m\rightarrow \mathbb {R}\) denotes the prediction error, \(x_{u}^{(i)}\) is any given upper level vector and \(\bar{x}_{l}^{(i)}\) is its corresponding optimal solution. The prediction error may be calculated as follows:

$$\begin{aligned} L(h(x_{u}^{(i)}),\bar{x}_{l}^{(i)})=|\bar{x}_{l}^{(i)}-h(x_{u}^{(i)})|^2. \end{aligned}$$

We have restricted the hypothesis space \(\mathcal {H}\) to consist of second-order polynomials which reduces the error minimization problem to an ordinary quadratic regression problem. Since we are approximating a vector valued mapping, one may use multiple scalar valued quadratic functions to create the approximate mapping. The sample can be created from the population members or an archive. It should be noted that this can approximate only single-valued mapping and will fail if the mapping becomes set-valued.

4.2 Approximating the \(\varphi \)-mapping

Once again, let \(\mathcal {H}\) be the hypothesis space of functions, and there exists a sample of upper and corresponding lower level points, our aim is to identify a model \(\hat{\varphi }\in \mathcal {H}\) that minimizes the empirical error on the sample, i.e.

$$\begin{aligned} \hat{\varphi }=\mathop {\mathrm{argmin}}\limits _{u \in \mathcal {H}}\sum _{i\in \mathcal {I}} L(u(x_{u}^{(i)}),\bar{f}^{(i)}), \end{aligned}$$
(2)

where \(L:\mathbb {R}\times \mathbb {R}\rightarrow \mathbb {R}\) denotes the prediction error, \(x_{u}^{(i)}\) is any given upper level vector and \(\bar{f}^{(i)}\) is its corresponding optimal function value. The prediction error can once again be computed as follows:

$$\begin{aligned} L(u(x_{u}^{(i)}),f^{(i)})=|\bar{f}^{(i)}-u(x_{u}^{(i)})|^2. \end{aligned}$$

We have once again restricted the hypothesis space \(\mathcal {H}\) to consist of second-order polynomials. Since the \(\varphi \)-mapping is always single valued, approximating it will not involve similar issues as for the \(\Psi \)-mapping.

5 Comparison results for \(\Psi \)- versus \(\varphi \)-approximations

For comparing \(\Psi \)-approximation approach against \(\varphi \)-approximation approach, we use a set of 8 test problems selected from the literature given in Tables 1 and 2. Table 3 compares the median function evaluations at both level for three algorithms \(\Psi \)-approximation, \(\varphi \)-approximation and nested algorithm. The results have been produced from 31 runs of the algorithm and further details about the runs can be found in Figs. 5 and 6. Both \(\Psi \)-approximation and \(\varphi \)-approximation perform equally well and outperform the nested approach in this study. The differences in the performance of \(\Psi \)-approximation and \(\varphi \)-approximation can be attributed to differences in the quality of approximation produced during the intermediate steps of the algorithm. In Table 4, we provide a comparison of the meta-modeling results with other evolutionary approaches (Wang et al. 2005, 2011) to provide an idea about the extent of savings that can be produced using meta-modeling techniques. The advantage is quite clear as the savings are better by multiple order of magnitudes on the set of test problems considered in this study.

Fig. 5
figure 5

Error plot from 31 runs for the upper level function evaluations on test problems 1 to 8

Fig. 6
figure 6

Error plot from 31 runs for the lower level function evaluations on test problems 1 to 8

Table 3 Median function evaluations required at upper level (UL) and the lower level (LL) from 31 runs of \(\Psi \)-approximation algorithm, \(\varphi \)-approximation algorithm and nested algorithm
Table 4 Mean of total function evaluations (UL evaluations +LL evaluations) required by different approaches
Table 5 Statistics for upper level function evaluations for \(\varphi \)-approximation algorithm on the modified test problems (m-TP)

It should be noted that the \(\Psi \) approximation idea would fail if the \(\Psi \)-mapping in bilevel optimization is set valued. Next, we test this hypothesis, by modifying the 8 test problems such that each test problem necessarily has a set-valued \(\Psi \)-mapping. To achieve this, we add two additional lower level variables (\(y_p\) and \(y_q\)) in each test problem. Both the upper and lower level functions are modified as shown below:

$$\begin{aligned} F^{new}(x_{u},x_{l})&= F(x_{u},x_{l})+y_{p}^{2}+y_{q}^{2}\\ f^{new}(x_{u},x_{l})&= f(x_{u},x_{l})+(y_{p} - y_{q})^{2}\\ y_{p}, y_{q}&\in [-1,1] \end{aligned}$$

The modification makes the lower level problem have infinitely many optimal solutions (for all \(y_{p} = y_{q}\)) for any given upper level vector. Out of the many optimal solutions the upper level prefers the solution where \(y_{p} = y_{q} = 0\). After this simple modification, we once again solve the test problems using \(\varphi \)-approximation and \(\Psi \)-approximation approaches. As shown in Tables 5 and 6, the \(\varphi \)-approximation algorithm still works but \(\Psi \)-approximation algorithm completely fails. Function evaluations for \(\varphi \)-approximation algorithm increases slightly than before because of additional variables in the problem.

Therefore, the \(\varphi \)-approximation idea clearly has an advantage over the \(\Psi \)-approximation idea. Moreover, \(\varphi \)-mapping is always a scalar valued mapping as compared to \(\Psi \)-mapping which is usually vector valued and can also be set-valued. However, there is a trade-off. The reduced single level problem formed using \(\Psi \)-mapping may usually be a little easier to handle as compared to the single level problem formed using \(\varphi \)-mapping. The reason being that in case of \(\Psi \)-mapping the lower level variables are readily available, and the reduced problem does not involve lower level constraints. For \(\varphi \)-mapping, the reduced problem has to be solved both with respect to upper and lower level variables, and the formulation involves both upper and lower level constraints. Given the pros and cons of using the two mappings, next, we would like to develop an evolutionary algorithm that is capable of utilizing the better of the two mappings while solving a bilevel optimization problem.

Table 6 Statistics for lower level function evaluations from 31 runs of the \(\varphi \)-approximation algorithm on the modified test problems (m-TP)

6 Bilevel evolutionary algorithm based on \(\Psi \) and \(\varphi \)-mapping approximations

In this section, we provide the bilevel evolutionary algorithm that approximates the \(\Psi \) as well as the \(\varphi \) mapping during the intermediate steps of the algorithm. From the previous experiments and the properties of the two mappings we infer that there can be situations when the approximation of the \(\Psi \)-mapping may fail, while when \(\Psi \)-mapping can be approximated it offers the advantage of completely ignoring the lower level functions and constraints. Acknowledging this fact, we utilize both the approximations in our algorithm. The algorithm adaptively decides to use one of the mappings based on the quality of fit obtained when approximating the two mappings. Local quadratic approximations are created for the two mappings from a sample of points in the vicinity of the point around which we want to create an approximation. Introducing local approximation is expected to improve the quality of approximations significantly. Given a sample dataset, the steps for creating an approximation are the same as discussed in Sects. 4.1 and 4.2. The algorithm also maintains an archive so as to maintain a large dataset for creating and validating the approximations. Deviating from the nested algorithm, we employ the approximated \(\Psi \) and \(\varphi \) mappings to avoid frequent lower level optimization calls. An earlier version of the algorithm (Sinha et al. 2017, 2013, 2014) that relied on \(\Psi \)-mapping approximation alone was referred as Bilevel Evolutionary Algorithm based on Quadratic Approximations (BLEAQ). We keep the same terminology and refer to the newer version of the algorithm as BLEAQ-II. The pseudocode for the algorithm has been provided in Table 7.

Table 7 Step-by-step procedure for BLEAQ-II

The genetic algorithm used in this study derives the ideas from Deb et al. (2002), and Sinha et al. (2006). In Deb et al. (2002), the authors developed a steady state genetic algorithm with elite preservation, which was shown to solve non-linear unconstrained optimization problems with small function evaluations and high level of accuracy. Later on, the idea was extended for constrained optimization in Sinha et al. (2006). The genetic algorithm in the current paper does not give a high preference to the top ranked members in the population for recombination, thus allowing exploration. It chooses random members from the population and and performs tournament selection to identify parents. The offspring produced from the genetic operations are compared against a small pool of random members from the population and enter the population only if it beats one or more members from the pool. Therefore, the chosen genetic algorithm is elitist, which is necessary to ensure infinite time convergence (Rudolph 1994), and at the same time is not too exploitative as the worst members from the population may not get eliminated immediately.

The user is free to replace the genetic algorithm used in this paper with any other evolutionary algorithm and can still solve the bilevel optimization problem by relying on the approximation of the mappings.

6.1 Initialization

The initialization in the algorithm is done by creating random upper level members \(x_u^{(1)},\ldots ,x_u^{(N)}\), and then solving the lower level optimization problem for each member to get optimal \(x_l^{(1)},\ldots ,x_l^{(N)}\). There can be situations, where finding random feasible upper and lower level pair that satisfy both lower and upper level constraints in the problem can be difficult. In such situations, one can solve the following problem to create \((x_u^{(i)},x_l^{(i)})\) pairs that satisfies all the constraints to begin with.

The above problem can be solved using any standard procedure like a greedy GA or SQP with a random starting point to arrive at a feasible solution. As soon as a feasible member is found, the method stops. Solving the above method repeatedly with random population (in case of GA) or a random starting point (in case of SQP) can provide the starting population of upper level members \(x_u^{(1)},\ldots ,x_u^{(N)}\) for the BLEAQ-II algorithm. For this given set of upper level members, we know that at least one feasible lower level member exists and we still need to solve the lower level problem to find the optimal lower level solutions \(x_u^{(1)},\ldots ,x_u^{(N)}\).Footnote 1

6.2 Constraint handling and fitness assignment

The proposed approach always assigns higher fitness to a feasible member over a non-feasible member. For two given members, \((x_u^{(i)},x_l^{(i)})\) and \((x_u^{(j)},x_l^{(j)})\), if both members are feasible with respect to constraints then it looks at the function value. If both members are infeasible, then it looks at the overall constraint violation. This fitness assignment scheme is similar to the one proposed in Deb (2000). At the lower level the idea can be implemented directly using lower level constraints and lower level function value. At the upper level, for any given upper and lower level pair, we only consider upper level function and constraints, without considering if the corresponding lower level vector is optimal. The information about a lower level vector corresponding to an upper level vector being optimal is stored using tagging (0 or 1).

6.3 Genetic operations

Offspring are produced in the BLEAQ-II approach using standard crossover and mutation operators. Genetic operations at the upper level involve only upper level variables, and the operations at the lower level involve only lower level variables. We utilize parent centric crossover (PCX) and polynomial mutation for generating the offspring. The crossover operator used in the algorithm is similar to the parent-centric crossover (PCX) operator proposed in Sinha et al. (2006). The operator uses three parents and produces offspring around the index parent as described below.

$$\begin{aligned} c = z^{(p)} + \omega _{\xi }d + \omega _{\eta }\frac{p^{(2)}-p^{(1)}}{2} \end{aligned}$$
(3)

where,

  • \(z^{(p)}\) is the index parent (the best parent among three parents)

  • \(d=z^{(p)}-g\), where g is the mean of \(\mu \) parents

  • \(p^{(1)}\) and \(p^{(2)}\) are the other two parents

  • \(\omega _{\xi }=0.1\) and \(\omega _{\eta }=0.1\) are the two parameters.

6.4 Approximation of mappings

For the quadratic approximation of mappings around a point \(x^{(j)}=(x^{(j)}_u,x^{(j)}_l) \in \mathcal {P}_{\text {off}}\), we use its neighboring members in the archive \(\mathcal {A}\) to create the approximation. Since we want a local approximation we choose the members closest to \(x^{(j)}\) in terms of Euclidean distance to create the mappings \(q_{\Psi }\) and \(q_{\varphi }\). A quadratic approximation in n dimensions requires at least \(\frac{n(n+1)}{2}\) points, therefore, we use \(\frac{n(n+1)}{2}\)+n points to create the approximation.

6.5 Termination criteria

A variance based termination criterion has been used at both levels; some other termination criterion like termination based on no improvement may also be used. Variance based termination allows the algorithm to terminate automatically when the variance of the population becomes small. At the upper level, the variance of the population at any generation, T, is computed as follows:

$$\begin{aligned} \begin{array}{l} \alpha _u^{T} = \frac{\sum _{i=1}^{n} \sigma ^2(x_{i})|_{T}}{\sum _{i=1}^{n} \sigma ^2(x_{i})|_{0}}, \end{array} \end{aligned}$$
(4)

When the value of \(\alpha _{u}^{T}\) at any generation T becomes less than the parameter \(\alpha _{u}^{stop}\) then the algorithm terminates. In the above equation, n is the number of upper level variables, \(\sigma ^2(x_{i})|_{T}\) is the variance across dimension i at generation T and \(\sigma ^2(x_{i})|_{0}\) is the variance across dimension i in the initial population. A similar termination scheme with parameter \(\alpha _{l}^{stop}\) can be used when the evolutionary algorithm is executed at the lower level.

6.6 Lower level optimization

At the lower level, we utilize SQP if the problem is convex, otherwise we use the lower level evolutionary algorithm described in Table 8 that uses similar genetic operations as used at the upper level.

Table 8 The lower level evolutionary algorithm is described below that takes an upper level member as input and solves the corresponding lower level problem

6.7 Offspring update

For an offspring \(x^{(j)}=(x^{(j)}_u,x^{(j)}_l)\), the lower level vector \(x^{(j)}_l\) is updated either using \(\Psi \)-approximation or \(\varphi \)-approximation. An update using \(\Psi \)-approximation is straightforward. However, if an update has to be done using \(\varphi \)-approximation, it requires to solve the following auxiliary optimization problem. In the auxiliary problem \(x_u\) is fixed as \(x^{(j)}_u\) and the problem is solved only with respect to \(x_l\). The optimal \(x_l\) replaces the lower level vector \(x^{(j)}_l\) of the offspring.

In the above formulation we use hat for all the functions and constraints as we solve the auxiliary problem on approximated functions and constraints. We use linear approximations for all the constraints, while quadratic approximation is used for the other functions. The auxiliary problem may have to be solved frequently if the lower level problem contains multiple optimal solutions. Solving the auxiliary problem with approximated functions helps in saving actual function evaluations. Note that in the ideal case the auxiliary problem will lead to an optimistic lower level solution corresponding to the fixed \(x^{(j)}_u\).

6.8 Local search

The algorithm utilizes local search after every k generations of the algorithm. The local search is performed by meta-modeling the upper and lower level functions and constraints along with the \(\Psi \) and the \(\varphi \)-mappings in the vicinity of the best member in the population. Once the \(\Psi \) and the \(\varphi \)-mappings are available, the quality of the two mappings are assessed by the mean square error of the approximations (i.e. \(e_{mse}^{\Psi }\) and \(e_{mse}^{\varphi }\)). The better mapping and the corresponding single level reduction (described in Sects. 3.1 and 3.2) with approximated functions is solved using SQP to arrive at \(x_{u}^{(LS)}\). A lower level optimization corresponding to \(x_{u}^{(LS)}\) is solved and if the member is found to be better than \(x_{best}^{(j)}\) then \(x_{best}^{(j)}\) is updated. In case the member is not better than the best member found so far, then the next local search is performed using the exact upper/lower level objective functions and constraints.

6.9 Parameters and platform

The algorithm has been implemented in MATLAB. At the upper and lower level, the parameters used in the algorithm are:

  1. 1.

    \(\mu =3\)

  2. 2.

    \(\lambda =2\)

  3. 3.

    \(\rho =2\)

  4. 4.

    Probability of crossover = 0.9

  5. 5.

    Probability of mutation = 0.1

  6. 6.

    \(N = 50\) (Population size at upper level)

  7. 7.

    \(n = 50\) (Population size at lower level)

  8. 8.

    \(\alpha _{u}^{stop} = \alpha _{l}^{stop} = 10^{-5}\) (Termination parameter)

  9. 9.

    \(Q = N/2\) (Minimum number of tag 1 members in the population)

7 Results

In this study, we consider three algorithms, the nested approach described in Fig. 4, BLEAQ (Sinha et al. 2017, 2013, 2014), and our proposed BLEAQ-II. To assess the performance of each algorithm, 31 runs have been performed for each test instance. During every simulation run, the algorithms are terminated when the objective function accuracy of \(10^{-2}\) is achieved at both levels from the bilevel optimum. For each run, the upper and lower level function evaluations required until termination are recorded separately. It is noteworthy that in bilevel optimization the algorithms can actually diverge and move away from the optimum even after finding it correctly. This happens usually when the lower level is not solved correctly for a given upper level vector; therefore, we have to ensure that every lower level solution is very close to the true optimum. A strict variance based termination criteria (\(\alpha _{l}^{stop} = 10^{-5}\)) is used for the lower level runs everywhere, which ensures that the lower level is always close to the optimum.

To allay the concerns about the divergence of the algorithms after finding the bilevel optimum we have also done some additional runs, where the algorithms terminate only based on the variance-based termination criterion at both levels (\(\alpha _{u}^{stop} = \alpha _{l}^{stop} = 10^{-5}\)) without any knowledge of the true optimum. Appendix D provides these additional results for all the approaches on all the test problems with the variance-based termination criterion.

7.1 Standard test problems

We first present the empirical results on 8 standard test problems selected from the literature (referred to as TP1-TP8). The description for these test problems has been provided in the Appendix A. Table 9 contains the median upper level (UL) function evaluations, lower level (LL) function evaluations and BLEAQ-II’s overall function evaluation savings as compared to other approaches from 31 runs of the algorithms. The overall function evaluations for any algorithm is simply the sum of upper and lower level function evaluations. For instance, for the median run with TP1, BLEAQ-II requires \(63\%\) less overall function evaluations as compared to BLEAQ, and \(98\%\) less overall function evaluations as compared to the nested approach.

Table 9 Median function evaluations on TP test suite. While computing savings, we compare the total function evaluations (sum of upper and lower level function evaluations) of one algorithm against the other

All these test problems are bilevel problems with small number of variables, and all the three algorithms were able to solve the 8 test instances successfully. A significant computational saving can be observed for both BLEAQ-II and BLEAQ, as compared to the nested approach as shown in the Savings column of Table 9. The performance gain going from BLEAQ to BLEAQ-II is quite significant for these simple test problems even though none of them lead to multiple lower level optimal solutions. Detailed comparison between BLEAQ and BLEAQ-II in terms of upper and lower level function evaluations is provided through Figs. 7 and 8.

7.2 Scalable test problems

Next, we compare the results for the three algorithms on the scalable SMD test suite which contains 12 test problems in the original paper (Sinha et al. 2014). We extend this test suite in this paper to a set of 14 test problems by adding two additional scalable test problems. The description for the additional SMD test problems can be found in Appendix B. First we analyze the performance of the algorithms on a smaller version of the test problems which consists of 5 variables, and then we provide the comparison results on 10-variable instances of the SMD test problems. For the 5 variable version of the SMD test problems, we used the settings as \(p=1\), \(q=2\) and \(r=1\) for all SMD problems except SMD6 and SMD14. For the 5 variable version of SMD6 and SMD14, we used \(p=1\), \(q=0\), \(r=1\) and \(s=2\). For the 10 variable version of the SMD test problems, we used the settings as \(p=3\), \(q=3\) and \(r=2\) for all SMD problems except SMD6 and SMD14. For the 10 variable version of SMD6 and SMD14, we used \(p=3\), \(q=1\), \(r=2\) and \(s=2\).

Fig. 7
figure 7

Bar chart (31 runs/samples) for the upper level function evaluations required for TP 1 to 8

Fig. 8
figure 8

Bar chart (31 runs/samples) for the lower level function evaluations required for TP 1 to 8

Table 10 provides the median function evaluations and overall savings for the three algorithms on the set of 14 SMD problems. These test problems contain 2 variables at the upper level and 3 variables at the lower level and offer a variety of tunable complexities to the algorithms. For instances, the test set contains problems which are multimodal at the upper and the lower levels, contain multiple optimal solutions at the lower level, contain constraints at the upper and/or lower levels etc. It can be found that BLEAQ-II is able to solve the entire set of 14 SMD test problems, while BLEAQ fails on 2 test problems. The overall savings with BLEAQ-II is higher as compared to BLEAQ for all the test problems. The test problems that contain multiple lower level solutions include SMD6 and SMD14, for which BLEAQ is unable to handle the problem. Further details about the required overall function evaluations from 31 runs can be found in Fig. 9.

Table 10 Median function evaluations on low dimension SMD test suite
Fig. 9
figure 9

Bar chart for overall function evaluations for SMD 1–14

Table 11 Median function evaluations on high dimension SMD test suite

Results for the high dimensional SMD test problems have been provided in Table 11. BLEAQ-II leads to much higher savings as compared to BLEAQ, and with higher dimensions BLEAQ is found to once again fail on SMD6 and also on SMD7 and SMD8. Both methods outperform the nested method on most of the test problems. We do not provide results for SMD9 to SMD14 as none of the algorithms were able to handle these problems. It is noteworthy that SMD9 to SMD14 offer difficulties like multi-modalities and highly constrained regions, which none of the algorithms were able to handle with the parameter setting used in this paper. Details for the 31 runs on each of these test problems can be found in Fig. 10.

Through Figs. 11 and 12, we provide the quality of prediction of the lower level optimal solution made by the \(\Psi \)-mapping and \(\varphi \)-mapping approach over the course of the algorithm. It is interesting to note that the quality of \(\varphi \)-approximation is better in the case of SMD1 test problem in Fig. 11, therefore, the prediction decisions are mostly made using the \(\varphi \)-approximation approach. However, for SMD13 in Fig. 12, which involves a difficult \(\varphi \)-mapping, the prediction decisions are made using the \(\Psi \)-approximation approach. Both these mappings are found to be improving with an increase in generations of the algorithm. The two figures show the adaptive nature of the BLEAQ-II algorithm in choosing the right approximation strategy based on the difficulties involved in a bilevel optimization problem.

Fig. 10
figure 10

Bar chart for overall function evaluations for 10-dimension SMD 1–8

Fig. 11
figure 11

Approximation error (in terms of Euclidean distance) of a predicted lower level optimal solution when using localized \(\Psi \) and \(\varphi \)-mapping during the intermediate generations of the BLEAQ-II algorithm on the 5-variable SMD1 test problem

Fig. 12
figure 12

Approximation error (in terms of Euclidean distance) of a predicted lower level optimal solution when using localized \(\Psi \) and \(\varphi \)-mapping during the intermediate generations of the BLEAQ-II algorithm on the 5-variable SMD13 test problem

8 An application problem

In this section, we discuss an application problem that involves two companies, where one is a leader and the other is a follower. The two companies produce multiple goods with an objective to maximize their individual profits. There is a hierarchy with the leader company enjoying first mover’s advantage and the follower company only observing the actions of the leader and then responding rationally. Both companies produce 5 products with limited resources. The leader company has complete knowledge about the follower company and wants to figure out the optimal production given the response of the follower. The problem to be solved by the leader is given as follows:

$$\begin{aligned} \max _{x,y}&\Pi _u (x,y)\\ \text{ s.t. }&y \in \mathop {\mathrm{argmax}}\limits _{y} \lbrace {\Pi _l (x,y): g_j(x,y) \le 0, j=1,\ldots ,J} \rbrace , \\&G_k(x,y) \le 0, k=1,\ldots ,K,\\&x, y \ge 0 , \end{aligned}$$

where \(\Pi _u\) and \(\Pi _l\) denote the upper and lower level profit functions. The variables x and y are vectors representing the quantity produced by the upper and lower level firms respectively. The constraints represent the respective resources. A detailed formulation for the above problem and its solution can be found in Appendix C. The problem was solved 31 times and the BLEAQ-II algorithm found the optimum with the upper and lower level function accuracy of at least \(10^{-2}\) in each run and required 473 upper level function evaluations and 9278 lower level function evaluations on average.

9 Conclusions

In this paper, we have presented a computationally efficient evolutionary algorithm for solving bilevel optimization problems. The algorithm is based on iterative approximations of two important theoretically motivated mappings; namely, the lower level rational reaction mapping and the lower level optimal value function mapping. The paper discusses about the pros and cons of utilizing these mappings in an evolutionary bilevel optimization algorithm by embedding them in a nested approach. Thereafter, an algorithm is developed that adaptively decides to use one of the mappings during the execution based on the characteristics of the bilevel optimization problem being solved. The proposed algorithm has been tested on a wide variety of bilevel test problems and it has been able to perform significantly better than other approaches in terms of computational requirements.