1 Introduction

Risk management and asset allocation are the bulk of portfolio management, and are of great interest for investors, risk managers and regulators. In most related cases, however, no closed-form solutions are available for many risk and asset allocation problems, unless we make further stringent hypotheses. More generally, several statistical problems, such as inferential statistics (Isaacson and Madsen (1976), Dong and Wets (2000), Shapiro et al. (2014)), statistical process control (Feng et al. (2018), Glizer and Turetsky (2019)), and estimation of density function (Cramer (1989)) among others, rely indeed on constrained optimizations. In such cases, classical methods for solving constrained optimization problems often employ the Lagrangian method, which requires the selection of the Lagrange multiplier so that the constraint is satisfied (e.g., Bertsekas (2014) for an illustration).

However, on the one hand, it is reported in the literature that standard optimization methods often break down in many situations: when problems involve non-differentiable functions, large and multi-modal search spaces, non-convex or non-linear programs, multiple objectives, high-dimensional problems, or integral constraints. Charalambous (1978), for example, identifies a number of situations where classical numerical optimization techniques fail and where standard methods produce undesirable results.

On the other hand, Simulated Annealing (SA) algorithms are classically used to solve optimization problems. From a strictly practical point of view, the phenomenon of cooling applied to metals to obtain an optimal temperature share similarities with many financial application optimization problems when solved using SA, as shown by, e.g., Cramer (1989), Woodside-Oriakhi et al. (2011) and Castellano et al. (2021). As stated by Robert and Casella (2013), it appears that the SA algorithm borrowed its name from metallurgy and from physics. Especially in computational physics, the relationship with portfolio optimization is clearer, since the problem solved by SA is an optimization problem, where the function to be minimized is called “energy” and the variance factor which controls the convergence is called “temperature”. In more technical terms, the underlying problems are basically the same, and SA algorithms are classically used to solve optimization problems in different areas of operation research such as, for instance: commodity queuing inventories (Shajin et al. (2021)), planning and release (Etgar et al. (1997), Etgar et al. (2017)), payment scheduling (He et al. (2009)), graph partitioning (D’Amico et al. (2002)), energy district design (Bergey et al. (2003)), political district design (Ricca and Simeone (2008)), and more recently for forecasting the relationship between oil and gas (Ftiti et al. (2020)). An area where SA reveals itself to be particularly useful is also risk management, with applications in many fields, e.g., tunnel construction (Ryu et al. (2015)), disaster risk management (Edrisi et al. (2020)) and facility layouts with uncertainty (Tayal and Singh (2018)). In our context of financial applications, it happens that a large number of optimization problems in finance can be solved by applying optimization heuristics, such as SA or Genetic Algorithms (GA), as for instance in Gilli and Schumann (2012). These methods can handle non-convex optimization problems overcoming many difficulties such as multiple local optima and discontinuities in the objective function. For example, Cramer (1989) applies SA to Markowitz (1952) portfolio selection problems with realistic constraints, whilst Woodside-Oriakhi et al. (2011) use SA for imposing a cardinality constraint onto the classical portfolio optimization program. Bick et al. (2013) also employ SA for solving constrained consumption-investment optimization problems, whilst Porth et al. (2016) propose to improve risk diversification combining portfolios of crop insurance policies and an SA approach to find the optimal allocation. More recently, Castellano et al. (2021) provide an approach to systemic risk minimization based on SA.

In this article, attention will be devoted to optimization problems for which: 1) the target function is not necessarily differentiable and 2) the constraints are characterized by non-analytically reducible integrals. It happens that analytical intractability of the integral constraint is a major cause of limits in the application of classical optimization techniques. We address this issue by exploring two alternative numerical optimization strategies. A first strategy consists of solving an auxiliary (approximated) problem, while the other entails the construction of an approximation algorithm. This model approximation approach decomposes further into two single sequential strategies, depending on the type of integral approximation technique employed, i.e. either simulation-based or deterministic-rule. In the latter case, the deterministic approximation approach generally involves the replacement of the non-analytically reducible integral with a numerical quadrature (see Shapiro et al. (2014) and Smith et al. (2015)). In the case of the simulation-based approach, the integral constraint is reformulated as a stochastic constraint and then a Monte Carlo simulation combined with another technique, such as the Sample Average Approximation (SAA), is implemented to approximate the problem. On the basis of these auxiliary problems, simulation-based optimization procedures can then be applied. However, the need to concurrently draw samples from both the decision and auxiliary variable spaces can make the algorithms computationally expensive. To alleviate this problem, methods such as the Group Independent Metropolis Hastings (GIMH) of Beaumont (2003) and the Inhomogeneous Markov Chain (IMC) simulation of de Mello and Bayraksan (2014) may be mobilized. As for the approximation algorithm approach, these entail, however, the replacement of the intractable target function by a (consistent or unbiased) estimator within the Markov Chain Monte Carlo (MCMC) algorithm. This procedure is referred to as the Monte Carlo Within Metropolis (MCWM) - Cf. Andrieu and Roberts (2009).

Before presenting our empirical results obtained in some settings of classical financial applications, the primary goal of this article is to design efficient and reliable simulation-based algorithms for solving integral constrained optimization problems by finding Extended Saddle Points (ESP) of a relevant penalty function reformulation of the integral constrained problem. Two SA algorithms are proposed, both relying on importance sampling strategies. Given a penalty function reformulation of the integral constrained problem, the first algorithm, called the Group Independence Metropolis Hastings Simulated Annealing (GIMHSA), requires the construction of an auxiliary problem upon which the GIMH method is implemented. The second method denominated as the Monte Carlo Within Metropolis Simulated Annealing (MCWMSA), differently, focuses on an approximation of the target (penalty) function used for the MCMC updating step of the MCWM method. We compare standard SA combined with quadrature and the two new algorithms first on a low dimension optimization problem given by the maximization of an integral constrained likelihood for a stress-strength reliability problem. Our empirical trials confirm the effectiveness of the two new algorithms and convergence properties similar to the standard SA. Nevertheless, in a higher dimension, standard SA becomes less effective while our proposed algorithm does not suffer from convergence issues. Our results indicate no clear differences in the convergence of the GIMHSA and MCWMSA (see Fig. 1) but we favor the use of the MCWMSA, since the coding difficulty is lower and it is rather quite less expensive in terms of computing time when the optimization problems are complex.

The SA algorithm we propose extends the Constrained SA (CSA) strategy based on the Extended Saddle Point (ESP) theory developed by Smith and Wong (2017) to an integral constrained setup. In contrast with the exogenously given penalty factor in the SA strategy proposed by Wah et al. (2006) and adopted in Smith et al. (2015), we endogenously obtain the penalty factors by augmenting the decision variables. More precisely, we optimize the penalty function by undertaking a minimization exercize over the decision variable space and maximization over the penalty factor space. This procedure therefore provides, in addition to the probabilistic descents in the problem-decision variable space as in conventional SA (as in Juditsky and Nemirovski (2020)), a probabilistic accent in the penalty factor subspace using a unified method based on a piecewise version of the Metropolis acceptance probability to control descents and ascents; this technique offers the benefit of reducing the risk for the chain to produce poor results. Embedded within the proposed SA-type algorithm is the simultaneous evaluation of the integral in the constraint which is done using a Monte Carlo integration procedure. The effectiveness of the algorithms is first assessed herein by applying them to an inferential problem on reliability in stress-strength with independent exponentiated exponential distributions. Our results in the simulation study here show that our algorithms are robust procedures for solving integral constrained statistical optimization problems.

The article is organized as follows. Section 2 introduces the integral constrained optimization problem and discusses the adopted method of the penalty function. In Sect. 3 we describe the constrained simulated annealing and discuss its implementation. Section 4 is dedicated to the empirical studies of the proposed algorithms, providing 3 applications using 6 sets of datasets used in previous publications. Section 5 concludes, whilst proofs and some robustness tests are left in the Appendix.

2 On integral constrained optimizations

Let \(\mathcal {X}=\{x: x\in \mathbb {R}^{n}\}\) be the solution space (i.e. the set of all feasible solutions) where x are bounded continuous variables. Let \(f: \mathcal {X}\rightarrow \mathbb {R}\) be a lower bounded objective function defined on the solution space, \(\mathcal {X}\). Furthermore, let \(g(x) = (g_{1}(x),\dots ,g_{I}(x))^{T}\) represent a vector of I equality constraint functions. Functions \(f(\cdot )\) and \(g(\cdot )\) are general functions that can be discontinuous and not in closed-form. In our case, one of the equality constraints, say \(g_{1}\), is assumed to be characterized by an integral function whose solution is not available in closed-form. The goal is to find a global minimum, \(x^{*}\), subject to the equality constraintsFootnote 1\(g(x)=0\) (i.e., \(x^{*}\in \mathcal {X}\) such that \(f(x)\ge f(x^{*})\), \(g(x)=0\) for all \(x\in \mathcal {X}\)). We have the following mathematical representation of the problem:

$$\begin{aligned} \begin{aligned}&\min _{x\in \mathcal {X}} f(x)\\ {\text {subject to}}~&g_{i}(x)=0, ~~\forall ~ i=1,\dots , I, \end{aligned} \end{aligned}$$
(1)

where \(\displaystyle {g_{1}(x)=\int _{\mathcal {Y}} h(x,y)dy - \psi _{0}}\) with \(\psi _{0}\) a real constant, \(\mathcal {Y}\subset \mathbb {R}^{m}\) and \(h: \mathcal {X}\times \mathcal {Y}\rightarrow \mathbb {R}\) a real-valued measurable function.

Several constraint handling methods have been developed in the literature for dealing with constrained optimization problems (Zhang et al. (2015)). In this article, we adopt the penalty function method because of its relative ease of use and amenability to complex optimization problems. More precisely, we reformulated the integral constrained optimization problem in Eq. 1 into an unconstrained one using the non-differentiable penalty described in Charalambous (1978). Furthermore, by relying on Extended Saddle Points (ESP) theory (Wah et al. (2006)), the ESP solution of the following minmax problem:

$$\begin{aligned} \begin{aligned} \underset{{x\in \mathcal {X}}}{{\text {min}}}~\underset{{\lambda \in \mathbb {R}_{+}^{r}}}{{\text {max}}}~ L(x,\lambda )&=f(x) + \left( \sum _{i=1}^{r}(\lambda _{i}|g_{i}(x)|)^{p}\right) ^{1/p}, \end{aligned} \end{aligned}$$
(2)

corresponds to a solution of the problem in Eq. 1, where \(p\ge 1\), and \(\lambda _{i}> 0\), \(i=1,2,\dots ,I\), are penalty factors. Accordingly, stochastic optimization algorithms, such as the CSA proposed by Wah et al. (2006) may be applied. Unfortunately, the penalty function, \(L(x,\lambda )\), in Eq. 2 cannot be evaluated analytically due to the integral constraint. In practice, researchers approximate the constraint, \(g_{1}(x)\) by a numerical approximant, \(g_{1}^{M}(x)\), so that \(g_{1}^{M}(x){\longrightarrow } g_{1}(x)~ \forall x\) as M goes to infinity. We propose here to apply the Monte Carlo method for numerical integration with a focus on importance sampling. Following this argument, the integral can be written in the form:

$$\begin{aligned} g_{1}(x)=\int _{\mathcal {Y}} h(x,y)dy - \psi _{0} = \int _{\mathcal {Y}} \dfrac{h(x,y)}{q(y;x)}q(y;x)dy - \psi _{0} = E_{q}\left[ \dfrac{h(x,Y)}{q(Y;x)} \right] -\psi _{0}, \end{aligned}$$
(3)

where q(yx) is a probability density function on \(\mathcal {Y}\), \(Y{\sim } q(\cdot ;x)\) and \(E_{q}[\cdot ]\) denotes the expectation with respect to the importance density q. The expectation can be approximated by the importance sampling estimator which reads:

$$\begin{aligned} g_{1}^{M}(x) = \dfrac{1}{M}\sum _{i=1}^{M}\dfrac{h(x,Y_{i})}{q(Y_{i};x)}, \end{aligned}$$
(4)

where \(Y_{i}{\mathop {\sim }\limits ^{iid}} q,\) \(i=1,\dots ,M\), is a set of auxiliary variables. Since \(g_{1}^{M}(x)\) converges to \(g_{1}(x)\) almost surely uniformly in x provided that the \(E_{q^{M}}[|g_{1}^{M}(x)|]<\infty \) (see Homem-de-Mello (2008), Feng et al. (2018) and Shapiro et al. (2014)), the approximation can always be made more precise by increasing M.

In light of the above, two alternative implementations of the importance sampling estimator (4) are explored in the construction of stochastic algorithms for estimating the ESP value and solution of the minmax problem in Eq. 2. The first consists of approximating the analytically intractable penalty function required for the computation of the acceptance probability of a simulated annealing type algorithm update, while the second entails optimizing an approximation of the analytically intractable penalty function in minmax problems of Eq. 2.

3 Simulated annealing for integral constrained optimization

In this section, we propose two SA type algorithms, based on the CSA developed by Wah et al. (2006), for solving the integral constrained problem in Eq. 1 by finding an ESP solution to the minmax problem in Eq. 2. We proceed with the construction of these algorithms in the following section.

3.1 Two simulated annealing type algorithms

Let us present the first algorithm proposal, i.e. the MCWMSA.

3.1.1 Monte carlo within metropolis simulated annealing (MCWMSA)

The MCWMSA algorithm is based on the combination of the Monte Carlo Within Metropolis (MCWM) of Andrieu and Roberts (2009) and the CSA of Wah et al. (2006). The MCWM is an iterative sampling algorithm where the MCMC uses an approximation of the target function in its execution. The MCWMSA consists of replacing the analytically intractable penalty function, \(L(x,\lambda )\), required for the computation of the acceptance probability in the CSA by \(L^{M}(x,\lambda )\) such that:

$$\begin{aligned} \begin{aligned} L^{M}(x,\lambda )&=f(x) + \left( (\lambda _{1}|g_{1}^{M}(x)|)^{p} +\sum _{i=2}^{r}(\lambda _{i}|g_{i}(x)|)^{p}\right) ^{1/p}. \end{aligned} \end{aligned}$$
(5)

In the next few lines, we shall present the details of the first Algorithm.

3.1.2 Algorithm 1

Let \((x,\lambda )\) be a point in the \(\mathcal {X}\times \mathbb {R}_{+}^{r}\) space and let \(G(x,\lambda ;x',\lambda ')\) be an arbitrary proposal transition function satisfying the condition \(G(x,\lambda ;x',\lambda ')>0\) if and only if \(G(x',\lambda ';x,\lambda )>0\). Denote the target function corresponding to the minmax problem 2 by:

$$\begin{aligned} \pi (x,\lambda )\propto \exp {(-L(x,\lambda )/T_{k})}, \end{aligned}$$
(6)

where \(T_{k}=T_{0}a^{k}\) (as in Wah et al. (2006)) is the temperature at the k-th iterate of the algorithm, \(T_{0}\) the initial temperature, and \(a\in (0,1)\) the cooling rate. After setting the length of the Markov Chain N and the initial state \((x,\lambda )\) with \(\lambda =0\), at the kth iteration the MCWMSA consists of the following steps:

  1. 1.

    Let \((x_{k,0},\lambda _{k,0})\) be the current state and \(T_{k}\) the current temperature, then use the MCWM algorithm to update the Markov Chain \((x_{k,\ell },\lambda _{k,\ell })\), \(\ell = 1,\ldots , N\), as follows:

    1. 1.a

      Draw \(u_{k,\ell }\sim \mathcal {B}er(q)\) where \(\mathcal {B}er(\cdot )\) denotes a Bernoulli distribution with q the probability to make a proposal in the solution space and \(1-q\) the probability to make a proposal in the penalty space.

    2. 1.b

      Draw a trial solution \((x_{k,\ell }^{*},\lambda _{k,\ell }^{*})\) from the proposal distribution \(G(x_{k,\ell -1}\), \(\lambda _{k,\ell -1};\) \(\cdot \), \(\cdot )\) given by:

      $$\begin{aligned} G(x_{k,\ell -1},\lambda _{k,\ell -1};x_{k,\ell }^{*},\lambda _{k,\ell }^{*}) = {\left\{ \begin{array}{ll} H(x_{k,\ell -1},x_{k,\ell }^{*})\delta (\lambda _{k,\ell -1} - \lambda _{k,\ell }^{*}) &{}{\text {if }u_{k,\ell }=1}\\ \tilde{H}(\lambda _{k,\ell -1},\lambda _{k,\ell }^{*})\delta (x_{k,\ell -1}-x_{k,\ell }^{*}) &{} {\text {if }u_{k,\ell }=0 }, \end{array}\right. } \end{aligned}$$
      (7)

      where \(\delta (x-y)\) denotes a Dirac function which takes the value 1 if \(x=y\) and 0 otherwise. \(H(x,\cdot )\) and \(\tilde{H}(\lambda ,\cdot )\) are, respectively, the proposal distributions for generating samples for the decision variables and penalty variables. A detailed description of the proposal distributions \(H(\cdot ,\cdot )\) and \(\tilde{H}(\cdot ,\cdot )\) is presented in the next Subsection 3.1.3.

    3. 1.c

      Draw importance samples \(y_{k,\ell ,1},\dots ,y_{k,\ell ,M}\) from \(q(\cdot \,;x_{k,\ell -1})\) and \(y_{k,\ell ,1}^{*}\), \(\dots \), \(y_{k,\ell ,M}^{*}\) from \(q(\cdot \,;x_{k,\ell }^{*})\).

      Then compute:

      $$\begin{aligned} w^{M}(x_{k,\ell }^{*},\lambda _{k,\ell }^{*};x_{k,\ell -1},\lambda _{k,\ell -1}) = {\left\{ \begin{array}{ll} \pi ^{M}(x_{k,\ell }^{*},\lambda _{k,\ell }^{*}) G_{l,l-1}&{}{\text {if }u_{k,\ell }=1}\\ \pi ^{M}(x_{k,\ell }^{*},\lambda _{k,\ell }^{*})^{-1} G_{l,l-1}&{}{\text {if }u_{k,\ell }=0},\\ \end{array}\right. } \end{aligned}$$
      (8)

      where: \(G_{l,l-1} = G(x_{k,\ell }^{*},\lambda _{k,\ell }^{*};x_{k,\ell -1},\lambda _{k,\ell -1})\), and:

      $$\begin{aligned} w^{M}(x_{k,\ell -1},\lambda _{k,\ell -1};x_{k,\ell }^{*},\lambda _{k,\ell }^{*}) = {\left\{ \begin{array}{ll} \pi ^{M}(x_{k,\ell -1},\lambda _{k,\ell -1}) G_{l-1,l} &{}{\text { if }u_{k,\ell }=1}\\ \pi ^{M}(x_{k,\ell -1},\lambda _{k,\ell -1})^{-1}G_{l-1,l}&{}{\text { if }u_{k,\ell }=0},\\ \end{array}\right. } \end{aligned}$$
      (9)

      where: \( G_{l-1,l} = G(x_{k,\ell -1},\lambda _{k,\ell -1};,x_{k,\ell }^{*},\lambda _{k,\ell }^{*}). \)

      Then compute:

      $$\begin{aligned} \pi ^{M}(x_{k,\ell -1},\lambda _{k,\ell -1}) = \exp (-L^{M}(x_{k,\ell -1},\lambda _{k,\ell -1})/T_{k}), \end{aligned}$$

      and:

      $$\begin{aligned} \pi ^{M}(x_{k,\ell }^{*},\lambda _{k,\ell }^{*}) = \exp (-L^{M}(x_{k,\ell }^{*},\lambda _{k,\ell }^{*})/T_{k}), \end{aligned}$$

      where:

      $$\begin{aligned} g_{1}^{M}(x) = \dfrac{1}{M}\sum _{i=1}^{M}\dfrac{h(x_{k,\ell -1},y_{k,\ell ,i})}{q(y_{k,\ell ,i};x_{k,\ell -1})}\,\, \hbox {and}\,\, g_{1}^{M}(x_{k,\ell }^{*}) = \dfrac{1}{M}\sum _{i=1}^{M}\dfrac{h(x_{k,\ell }^{*},y_{k,\ell ,i}^{*})}{q(y_{k,\ell ,i}^{*};x_{k,\ell }^{*} )}. \end{aligned}$$

      Accept the trial solution \((x_{k,\ell }^{*},\lambda _{k,\ell }^{*})\) with probability:

      $$\begin{aligned} A^{M}_{T_{k}}(x_{k,\ell -1},\lambda _{k,\ell -1};x_{k,\ell }^{*},\lambda _{k,\ell }^{*}) = \min \left\{ \dfrac{w^{M}(x_{k,\ell }^{*},\lambda _{k,\ell }^{*};x_{k,\ell -1},\lambda _{k,\ell -1}) }{w^{M}(x_{k,\ell -1},\lambda _{k,\ell -1};x_{k,\ell }^{*},\lambda _{k,\ell }^{*})},1\right\} , \end{aligned}$$
      (10)

      or reject it with probability \((1-A_{T_{k}}^{M}(x_{k,\ell -1},\lambda _{k,\ell -1};x_{k,\ell }^{*},\lambda _{k,\ell }^{*}))\).

  2. 2.

    Stop the MCWMSA and keep both the last solution \((x_{k,N},\lambda _{k,N})\) and the corresponding estimated value of the minmax problem in Eq. 2 if the current temperature \(T_{k}\) is below a threshold value \(T^{*}\); otherwise, update k to \(k+1\), and proceed to step (1) with \((x_{k+1,0},\lambda _{k+1,0}) = (x_{k,N},\lambda _{k,N})\).

Algorithm#1 in Appendix C provides a pseudo-code for the MCWMSA.

3.1.3 Group independence metropolis hastings simulated annealing (GIMHSA)

Let us now turn to the presentation of the second proposed algorithm, i.e. the GIMHSA.

The GIMHSA is similar in spirit to the MCWMSA but differs in that the algorithm operates directly on an approximation of the minmax problem in Eq. 2 given by the following:

$$\begin{aligned} \begin{aligned} \underset{{x\in \mathcal {X}}}{{\text {min}}}~\underset{{\lambda \in \mathbb {R}_{+}^{r}}}{{\text {max}}}~ L^{M}(x,\lambda )&=f(x) + \left( (\lambda _{1}|g_{1}^{M}(x)|)^{p} +\sum _{i=2}^{r}(\lambda _{i}|g_{i}(x)|)^{p}\right) ^{1/p}. \end{aligned} \end{aligned}$$
(11)

Since the minmax problem in Eq. 11 depends on the set of random vectors \(Y_{1:M}=\{Y_{1},\dots ,Y_{M}\}\), its ESP value, \(L^{*M}\), is an estimator of the ESP value, \(L^{*}\), of the problem in Eq. 2, and an ESP solution pair \((x^{*M},\lambda ^{*M})\) from the set of ESP solutions to problem 11 is an estimator of a ESP solution pair, \((x^{*},\lambda ^{*})\), of the problem 2.

In the next few lines, we now present in details the second algorithm proposal.

3.1.4 Algorithm 2

Let \((x,\lambda )\) be a point in the \(\mathcal {X}\times \mathbb {R}_{+}^{r}\) space and let \(y_{1:M}=\{y_{1},\dots ,y_{M}\}\) be a random sample of size M with \(y_{i}\in \mathcal {Y}\) and \(Y_{i}\sim q_{x},~\forall i\). Also, let \(G(x,\lambda ;x',\lambda ')\) be an arbitrary proposal transition function satisfying the condition \(G(x,\lambda ;x',\lambda ')>0\) if and only if \(G(x',\lambda ';x,\lambda )>0\). Denote the target function corresponding to the minmax problem 11 by:

$$\begin{aligned} \pi ^{M}(x,\lambda )\propto \exp {(-L^{M}(x,\lambda )/T_{k})}, \end{aligned}$$
(12)

where \(T_{k}=T_{0}a^{k}\) is the temperature at the k-th iteration of the algorithm, \(T_{0}\) the initial temperature, and \(a\in (0,1)\) the cooling rate. After setting the length of the Markov Chain N and the initial state \((x,\lambda )\) with \(\lambda =0\), the GIMHSA iterates the following steps:

  1. 1.

    Let \((x,\lambda )\) be the current state, \(y_{1:M}=\{y_{1},\dots .y_{M}\}\) the current collection of importance samples drawn independently from \(q_{x}\), and \(T_{k}\) the current temperature. Compute:

    $$\begin{aligned} \pi ^{M}(x,\lambda ) \propto \exp (-L^{M}(x,\lambda )/T_{k}),~~{\text {with}}~~ g_{1}^{M}(x) = \dfrac{1}{M}\sum _{i=1}^{M}\dfrac{h(x,y_{i})}{q_{x}(y_{i})}, \end{aligned}$$

    then use the Metropolis Hastings algorithm to update the Markov Chain N times as follows:

    1. 1.a

      Draw \(u\sim \mathcal {B}er(q)\) where q is the probability to make a proposal in the solution space and \(1-q\) the probability to make a proposal in the penalty space.

    2. 1.b

      Draw a trial solution \((x^{*},\lambda ^{*})\) from the proposal distribution \(G(x,\lambda ;\cdot ,\cdot )\) given by:

      $$\begin{aligned} G(x,\lambda ,x^{*},\lambda ^{*}) = {\left\{ \begin{array}{ll} H(x,x^{*})\delta _{\lambda }(\lambda ^{*}) &{}{\text {if }u=1}\\ \tilde{H}(\lambda ,\lambda ^{*})\delta _{x}(x^{*}) &{} {\text {if }u=0 }, \end{array}\right. } \end{aligned}$$
      (13)

      where \(H(x,\cdot )\) and \(\tilde{H}(\lambda ,\cdot )\) are respectively the proposal distributions for generating samples for the decision variables and penalty variables. A detailed description of the proposal distributions \(H(\cdot ,\cdot )\) and \(\tilde{H}(\cdot ,\cdot )\) is presented in Sect. 3.1.3.

    3. 1.c

      Draw importance samples \(y_{1}^{*},\dots ,y_{M}^{*}\) from \(q_{x^{*}}\). Then compute:

      $$\begin{aligned} w^{M}(x^{*},\lambda ^{*};x,\lambda ) = {\left\{ \begin{array}{ll} \pi ^{M}(x^{*},\lambda ^{*}) G(x^{*},\lambda ^{*};x,\lambda ) &{}{\text { if }u=1}\\ \pi ^{M}(x^{*},\lambda ^{*})^{-1} G(x^{*},\lambda ^{*};x,\lambda ) &{}{\text { if }u=0}, \end{array}\right. } \end{aligned}$$
      (14)

      where:

      $$\begin{aligned} \pi ^{M}(x^{*},\lambda ^{*}) \propto \exp (-L^{M}(x^{*},\lambda ^{*})/T_{k}),~~{\text {with}}~~ g_{1}^{M}(x^{*}) = \dfrac{1}{M}\sum _{i=1}^{M}\dfrac{h(x,y_{i}^{*})}{q_{x^{*}}(y_{i}^{*})}. \end{aligned}$$

      Accept the trial solution \((x^{*},\lambda ^{*})\) and \(y_{1:M}^{*}\) with probability:

      $$\begin{aligned} A_{T_{k}}^{M}(x,\lambda ;x^{*},\lambda ^{*}) = \min \left\{ \dfrac{w^{M}(x^{*},\lambda ^{*};x,\lambda ) }{w^{M}(x,\lambda ;x^{*},\lambda ^{*})},1\right\} , \end{aligned}$$
      (15)

      or reject it with probability \((1-A_{T_{k}}^{M}(x,\lambda ;x^{*},\lambda ^{*}))\).

  2. 2.

    Stop the GIMHSA and keep both the last solution \((x,\lambda )\), \(y_{1:M}\) and the corresponding estimated value of the minmax problem in Eq. 11 if the current temperature \(T_{k}\) is less than \(10^{-6}\); otherwise, update k to \(k+1\), and proceed to step (1).

It is worth noting that the GIMHSA procedure may also be viewed as an approximation of a Metropolis Hastings algorithm with target density \(\pi (x,\lambda )\), making it a special case of the MCWMSA algorithm. This dual interpretation of the GIMHSA presents an opportunity for designing algorithms that inherit the potential efficiency of a Markov Chain with a one-step transition kernel (denoted later on \(P_{GIMHSA}\), see sub-Section 3.2 below) while still being able to produce samples from the target function, and not an approximation. A pseudo-code for the GIMHSA is provided in Algorithm#1 in Appendix C.

3.1.5 Proposal distributions \(H(x,x')\) and \(\tilde{H}(\lambda ,\lambda ')\)

In generating random points, \((x',\lambda ')\), from the neighbourhood, \(\mathcal {N}(x,\lambda )\), of the current point \((x,\lambda )\) using the proposal distributions \(G(x,\lambda ;x',\lambda ')\), we start off, at each iteration step, by making a decision either to update the decision variables, x, while the penalty factors are kept fixed or as an alternative to update the penalty factor, \(\lambda \), maintaining the decision variables constant. In our Metropolis Hastings SA version, decision variables, x, need to be updated more frequently than the penalty factors, \(\lambda \), to enforce the convergence of the algorithm. Thus, we choose large values of the generation ratio, q, for example \(q=20n/(20n+r)\) with n and r being, respectively, the number of decision variables and the number of constraints. Upon doing this, we then draw trial solutions for the decision variables from \(H(x,\cdot )\) and the penalty factors from \(\tilde{H}(\lambda ,\cdot )\). Each of these proposal distributions are described below.

As regards to the proposal distribution \(H(x,\cdot )\) for the decision variables, given the challenging task of finding one good proposal distribution in n dimensions for generating \(x'\) from the neighbourhood of x, we propose n good proposal distributions for updating each component of x. However, to improve the efficiency of the algorithm, we only make \(x'\) to differ from x in the j-th component at each iteration step. Let j be a uniform number in \(\{1,2,\dots ,n\}\) and \(\{\sigma _{j}: 1\le j\le n\}\), a set of parameters where \(\sigma _{j}\) represents the scale parameter for the proposal distribution of the j-th decision variable. We generate a trial solution for the decision variables as follows:

$$\begin{aligned} \begin{aligned} x'&= x + (\theta _{1}e_{j,1}^{(1)} , \theta _{2}e_{j,2}^{(1)} ,\dots , \theta _{n}e_{j,n}^{(1)})^{T}, \end{aligned} \end{aligned}$$
(16)

where \(\mathbf{{e}}_{j}^{(1)}=(e_{j,1}^{(1)},\dots ,e_{j,n}^{(1)})\) is the j-th column of an \(n\times n\) identity matrix and \(\theta _{j}\) is Cauchy distributed as such:

$$\begin{aligned} f_{\theta }(\theta _{j}) = \dfrac{1}{\pi } \dfrac{\sigma _{j}}{\sigma _{j}^{2} +\theta _{j}^{2}}. \end{aligned}$$

In addition, we dynamically update the scale parameter of the Cauchy proposal corresponding to the \(j-\)th component of the decision variable, \(\sigma _{j}\) according to the following rule (see Andrieu and Thoms (2008)):

$$\begin{aligned} \log {(\sigma _{j}^{(k+1)})} = \log {(\sigma _{j}^{(k)})} + \dfrac{1}{(k+1)}({\overline{A}}_{j,T_{k}} - A), \end{aligned}$$

where \({\overline{A}}_{j,T_{k}}\) denotes the simple average of the acceptance probabilities associated with the \(j-\)th decision variable at the k-th temperature step, and A is the target acceptance rate set to 0.44 as in Andrieu and Thoms (2008).

In generating the trial point \(\lambda '\) from \(\tilde{H}(\lambda ,\cdot )\), we follow a similar strategy to that of the decision variables. More precisely, we generate \(\lambda '\) so that it differs from \(\lambda \) only in the j-th component where j is a uniform random number on the set \(\{1,2,\dots ,I\}\). A trial point for \(\lambda _{j}'\) is drawn from a uniform distribution over the interval \((a_{j}^{(\lambda )}, b_{j}^{(\lambda )})\) with \(a_{j}^{(\lambda )}= \max \{\lambda _{j}-w_{j}^{(\lambda )}h_{j}(z), 0\}\), \(b_{j}^{(\lambda )}= \min \{\lambda _{j}+w_{j}^{(\lambda )}h_{j}(z), 10,000\}\) and \(w_{j}^{(\lambda )}\) represent a scale parameter that controls the width of the search space. Note that the neighbourhood of \(\lambda _{j}\) from which \(\lambda _{j}'\) is generated is adjusted according to the degree of constraint violation. Thus, whenever \(g_{j}(x)=0\) is satisfied, \(\lambda _{j}\) does not need to be updated. Furthermore, we adjust the scale parameter, \(w_{j}^{(\lambda )}\), according to the following rule:

$$\begin{aligned} w_{j}^{(\lambda )} = {\left\{ \begin{array}{ll} \eta _{0}w_{j}^{(\lambda )} &{}~~{\text {if }g_{j}(x)>\tau _{0}T}\\ \eta _{1}w_{j}^{(\lambda )} &{}~~{\text {if }g_{j}(x)<\tau _{1}T}\\ {\text {unchanged}} &{}~~{\text {otherwise}}, \end{array}\right. } \end{aligned}$$
(17)

where we set, experimentally, \(\eta _{0}=1.25\), \(\eta _{1}=0.95\), \(\tau _{0}=1.0\) and \(\tau _{1}=0.01\) so that the MH chain is not moving too fast in the penalty subspace. That is, when \(g_{j}(x)\) is reduced too quickly (i.e. \(g_{j}(x)< \tau _{1} T \) is satisfied), \(g_{j}(x)\) is over-weighted, leading to a possibly poor objective value or to some difficulties in satisfying other under-weighted constraints. Hence, we reduce \(\lambda _{j}\)’s neighborhood. In contrast, if \(g_{j}(x)\) is reduced too slowly (i.e. \(g_{j}(x)> \tau _{0} T \) is satisfied), we enlarge \(\lambda _{j}\)’s neighborhood in order to improve its chance of satisfaction. Note that \(w_{j}^{\lambda }\) is adjusted using T as a reference because constraint violations are expected to decrease when T decreases.

3.2 Asymptotic convergence of MCWMSA and GIMSA

In this section, we show the asymptotic convergence of both the MCWMSA and GIMSA to a constrained global minimum in a discrete constrained optimization problem.

3.2.1 Asymptotic convergence of MCWMSA

The MCWMSA algorithm (Algorithm 1) describes a Markov Chain \((x_{i},\lambda _{i})\), \(i=1,2,\ldots \), in the state vector \((x,\lambda )\) since the auxiliary variables y are drawn independently at each iteration. Modeling this chain as an Inhomogeneous Markov Chain that consists of a sequence of Homogeneous Markov Chains of a finite length, each at a specific temperature in a cooling schedule. The one-step transition kernel, \(P_{MCWMSA}(x,\lambda ;x',\lambda ')\), of the MCWMSA algorithm from \((x,\lambda )\) to \((x',\lambda ')\) is:

$$\begin{aligned}&P_{MCWMSA}(x,\lambda ;x',\lambda ') = \int _{\mathcal {Y}^{M}\times \mathcal {Y}^{M}}G(x,\lambda ;x',\lambda ')A(x,\lambda ,y;x',\lambda ',y')dydy' \nonumber \\&\qquad + \delta _{(x',\lambda ')}(x,\lambda )\left( 1 -\int _{\mathcal {N}(x,\lambda )}\int _{\mathcal {Y}^{M}\times \mathcal {Y}^{M}} A(x,\lambda ,y;x',\lambda ',y')dydy'G(x,\lambda ;dx',d\lambda ')\right) ,\nonumber \\ \end{aligned}$$
(18)

where:

$$\begin{aligned} A(x,\lambda ,y;x',\lambda ',y')=A_{T}^{M}(x,\lambda ,y;x',\lambda ',y') q^{M}_{x}(y)q^{M}_{x'}(y'), \end{aligned}$$

and \(\mathcal {N}(x,\lambda )\) denotes the neighborhood of \((x,\lambda )\), \(q_{x}^{M}\) the joint probability distribution of the importance densities and \(G(x,\lambda ;x',\lambda ')\), the proposal distribution of \((x',\lambda ')\in \mathcal {N}(x,\lambda )\) conditional on \((x,\lambda )\) which satisfies:

$$\begin{aligned} G(x,\lambda ;x',\lambda ') = G(x',\lambda ';x,\lambda )\,\,\hbox {and}\,\, 0\le G(x,\lambda ;x',\lambda ')\le 1, \end{aligned}$$
(19)

with:

$$\begin{aligned} \int _{\mathcal {N}(x,\lambda )} G(x,\lambda ;dx',d\lambda ')=1. \end{aligned}$$
(20)

Since the acceptance probability in the MCWMSA depends upon an approximation of the target function, the invariant distribution of \(P_{MCWMSA}\) is not \(\pi {(x,\lambda )}\) and therefore MCWMSA will not produce samples from \(\pi {(x,\lambda )}\) even in a steady state. However, by relying on the ergodicity of a one-step transition kernel, \(P_{MCWM}\), of an MCWM sampling procedure, Andrieu and Roberts (2009) show that these samples are distributed asymptotically according to an approximation of \(\pi {(x,\lambda )}\), which should be all more precise, as the sample size grows large. In view of this, it suffices to show that the Markov Chain is ergodic.

Theorem 1

Let \((x,\lambda )\) denote the state of the inhomogeneous Algorithm 1. The couple \((x,\lambda )\) is strongly ergodic if the annealing schedule \(\{T_{k}: k=0,1,\dots \}\) satisfies:

$$\begin{aligned} T_{k}\ge \dfrac{N\varDelta _{L^{M}}}{\log {(k+1)}}, \end{aligned}$$
(21)

where:

$$\begin{aligned} \left\{ \begin{array}{l} T_{k}>T_{k+1}\,\hbox {and}\,\displaystyle {\lim \limits _{k\rightarrow \infty }T_{k}=0}\\ {\displaystyle { \varDelta _{L^{M}}=2\max _{(x',\lambda ')\in \mathcal {N}(x,\lambda )} \{|L^{M}(x',\lambda ') - L^{M}(x,\lambda )|\}}}. \end{array} \right. \end{aligned}$$

Proof

See Appendix A. \(\square \)

3.2.2 Asymptotic convergence of GIMHSA

In this case, the sequence \((x_{i},\lambda _{i})\), \(i=1,2,\ldots \), is no longer a Markov Chain, but rather the triplet \((x_{i},\lambda _{i},y_{i})\), \(i=1,2,\ldots \), is. The GIMHSA defines a Markov Chain with a one-step transition kernel, \(P_{GIMHSA}(x,\lambda ,y;x',\lambda ',y')\), given by:

$$\begin{aligned}&P_{GIMHSA}(x,\lambda ,y;x',\lambda ',y') = G(x,\lambda ;x',\lambda ')A^{M}_{T}(x,\lambda ;x',\lambda ')q^{M}_{x'}(y') \nonumber \\&\qquad + \delta _{(x',\lambda ',y')}(x,\lambda ,y) \left( 1-\int _{\mathcal {N}(x,\lambda )}\int _{\mathcal {Y}^{M}} A(x,\lambda ,y;x',\lambda ',y')dy'G(x,\lambda ;dx',d\lambda ')\right) ,\nonumber \\ \end{aligned}$$
(22)

where:

$$\begin{aligned} A(x,\lambda ,y;x',\lambda ',y')=A_{T}^{M}(x,\lambda ;x',\lambda ')q^{M}_{x'}(y'). \end{aligned}$$

Since there exists an integer M so that \(P_{GIMHSA}^{M}(x,\lambda ,y;x'\),\(\lambda ',y')>0\), for all \((x,\lambda ,y),(x',\lambda ',y')\) the chain is irreducible. The chain is also aperiodic, thus the samples \((x_{i},\lambda _{i})\) produced are distributed in the limit as \(i\rightarrow \infty \) according to the target function \(\pi ^{M}(x,\lambda )\). The GIMH can be viewed as a special case of the MCWM sampling strategy, thus the irreducible and aperiodic property of the GIMHSA sampling scheme can easily be deduced from the weak ergodicity property of the MCWMSA algorithm.

4 Numerical illustrations

In this section, we illustrate the efficiency of our algorithms by first conducting an inferential exercize on reliability in stress-strength problems and, second, optimization exercizes for financial portfolio allocation problems.

To begin with, we make explicit all the necessary parameters required in the successful implementation of the SA algorithms. These include the parameters constituting the cooling schedule, i.e. the initial temperature (\(T_{0}\)), the cooling function, the Metropolis loop length (\(N_{T}\)), and the stopping condition. The logarithmic cooling schedule is a well known schedule that guarantees convergence. Nonetheless, this requires infinitely slow cooling. In lieu, most users prefer the geometric cooling proposed by Kirkpatrick et al. (1983), as such:

$$\begin{aligned} T_{k} = a T_{k-1}, ~~ k=1,2,\dots , M, \end{aligned}$$

where \(T_{0}\) and a, respectively, denote the initial temperature and the cooling rate. Consistent with the literature, the cooling rate, \(0.8< a < 0.99\), which has been shown to be quite effective across many applications (e.g., Stander and Silverman (1994), Wah et al. (2006)) is set to 0.95 in this article. Results of robustness checks are made available in Appendix B. The closer the value of the cooling factor a is to 1, the slower the rate of decay. As for the initial temperature, \(T_{0}\), we choose a value to ensure that around 60% of initial uphill moves are accepted (as in Rayward-Smith et al. (1996)), or similarly to achieve an acceptance probability of about 0.90 irregardless of the value of the cost function (e.g., Rutenbar (1989), Bohachevsky et al. (1986)). To ensure that the simulated annealing spends sufficient time to thoroughly explore local optimal at each temperature, we fixed a priori the number of iterations, \(N_T\), to 1, 000 (as in Fisk et al. (2022)). The process is terminated at a temperature, \(T^{*}\), low enough that useful improvements can be expected anymore. Arguably, higher values of \(T^{*}\) would downgrade the precision of the output; on the contrary, smaller values would increase the accuracy of the numerical solution, but at the cost of a larger computing time. On this basis, we fixed the terminating temperature to \(10^{-6}\) (see van Laarhoven et al. (1992) and Fisk et al. (2022)).

4.1 First application: reproduction of inference and reliability in a stress-strength problem

For the sake of a better demonstration of the two new algorithms, we first employ a low dimension optimization problem given by the maximization of an integral constrained likelihood for a stress-strength reliability problem. We consider the inferential exercize on reliability in stress-strength problems with independent Exponentiated Exponential (EE) distributions (as in Gupta and Kundu (2001)) studied by Smith et al. (2015). This problem fits the form of Eq. 1 and reads:

$$\begin{aligned} \begin{aligned}&\max _{\varvec{\phi }\in \mathcal {X}}~ l(\varvec{\phi };x,y)\\ {\text {subject to}}~&\mathcal {R} - \psi _{0}=0, \end{aligned} \end{aligned}$$
(23)

where \(l(\cdot )\) is the Log-likelihood, \(\phi \) the parameter vector, x and y are, respectively, outcomes of random variables X and Y, and \(\psi _0\) the target reliability value.

Given two random samples \(x=\{x_{1},\dots ,x_{n}\}\) and \(y=\{y_{1},\dots , y_{m}\}\) drawn, respectively, from two independent EE distributionsFootnote 2, \(F_{X}(\cdot )\) and \(F_{Y}(\cdot )\), \(l({\varvec{\phi }};x,y)\) in Eq. 23 is given by:

$$\begin{aligned} \begin{aligned} l({\varvec{\phi }}; x, y)&= l\left( \alpha _{1}, \beta _{1}, \alpha _{2}, \beta _{2}; x, y\right) \\&= n \log \alpha _{1}+n \log \beta _{1}+\left( \alpha _{1}-1\right) \sum _{i=1}^{n} \log \left( 1-e^{-\beta _{1} x_{i}}\right) -\beta _{1} \sum _{i=1}^{n} x_{i} \\&\quad +m \log \alpha _{2}+m \log \beta _{2}+\left( \alpha _{2}-1\right) \sum _{j=1}^{m} \log \left( 1-e^{-\beta _{2} y_{j}}\right) -\beta _{2} \sum _{j=1}^{m} y_{j}, \end{aligned} \end{aligned}$$
(24)

and is the Log-likelihood function of the parameter vector (decision variables) \({\varvec{\phi }}=(\alpha _{1},\beta _{1},\alpha _{2},\beta _{2})\) of two independent exponentiated exponential (EE) distributions.

It is important to note that even though this problem is only four-dimensional, it is still a challenging problem. The Log-likelihood, \(l(\cdot )\), fails a number of commonly assumed properties of Log-likelihood functions which provide support and guarantee the parameter estimation results obtained when classical optimization techniques are applied. For instance, the Log-likelihood of the exponentiated exponential distribution is neither a concave, nor even a quasi- or pseudo-concave function; there exists a small region where the gradient of \(l(\cdot )\) vanishes with a negative definite Hessian matrix; elsewhere, a frequent change in the sign of the determinant of the Hessian matrix is observed (see Smith et al. (2015) for a detailed discussion). These observations therefore suggest that extreme care has to be taken should a derivative-based technique be opted for.

In problem 23, \(\mathcal {R}\) denotes the reliability between two EE distributions. Within the reliability framework, the stress-strength model describes the life of a component which has a random strength X that is subjected to a random stress Y. The component is adjudged to fail immediately the stress surpasses the strength, otherwise the component is considered to function properly (i.e. \(X>Y\)). Therefore, \(\mathcal {R}= Pr(Y <X)\) is a suitable measure for quantifying reliability. Applications of this concept can be found in many areas, especially in engineering concepts such as structures, deterioration of rocket motors, static fatigue of ceramic components, fatigue failure of aircraft structures, and the aging of concrete pressure vessels. In our case, given that both stress, Y, and strength, X, are described by independent EE distributions, then the reliability \(\mathcal {R}\) of the random variables can be written as (see Kundu and Gupta (2005)):

$$\begin{aligned} \mathcal {R} = \beta _{1}\alpha _{1}\int _{0}^{\infty } \exp (-\beta _{1}x) \left( 1- \exp {(-\beta _{1}x)} \right) ^{\alpha _{1}-1}\left( 1- \exp {(-\beta _{2}x)} \right) ^{\alpha _{2}}dx. \end{aligned}$$
(25)

Except in a few exceptional cases, the integral in Eq. 25 is known not to be analytically reducible (see Nadarajah (2011)). In addition, \(\mathcal {R}\) has been shown to be homogeneous of degree 0 in \((\beta _{1},\beta _{2})\), upon the introduction of a change of variable.Footnote 3 This observation suggests that the contours of \(\mathcal {R}\) are constant along the line in the parameter space satisfying \(\beta _{2}=c\beta _{1}\), with c being a real constant.

While noting that the probability density function, \(f_{X}(\cdot )\), of the EE distribution, \(EE(\alpha _{1},\beta _{1})\), can be expressed as:

$$\begin{aligned} f_{X}(x) = \beta _{1}\alpha _{1} \exp (-\beta _{1}x) \left( 1- \exp {(-\beta _{1}x)} \right) ^{\alpha _{1}-1},~~{\tiny }\alpha _{1}>0, \beta _{1}>0, x>0, \end{aligned}$$
(26)

then the integral in Eq. 25, \(\mathcal {R}\), can be written as:

$$\begin{aligned} \mathcal {R} = E\left[ (1-e^{-\beta _{2} X})^{\alpha _{2}}\right] . \end{aligned}$$

We sample directly from the distribution:

$$\begin{aligned} q_{\phi }(v) = \alpha _{}\beta _{1} \exp (-\beta _{1}v) \left( 1- \exp {(-\beta _{1}v)} \right) ^{\alpha _{1}-1}, \end{aligned}$$

with samples (\(M=10,000\) in Eq. 4) drawn by using:

$$\begin{aligned} V = -\dfrac{1}{\beta _{1}}\log {\left( 1-F^{1/\alpha _{1}}\right) }, ~~F\sim U(0,1), \end{aligned}$$

during the execution of our algorithms.

Smith and Wong (2017) present the difficulties (i.e. numerical accuracy issues, poor performances when the value of \(\psi _{0}\) is close to the boundary) involved in using the Lagrange multiplier approach to solve the integral constrained problem 1. They also describe an application of simulated algorithms to integral constrained problems. Ours defer from theirs in a number of ways. First, while Smith and Wong (2017) optimized sequentially the penalty function, by gradually increasing the penalty value until convergence is reached, the decision variables as well as the penalty values are simultaneously optimized within our Monte Carlo styled strategy. Our strategy when compared to the sequential optimization procedure adopted by Smith and Wong (2017), has the potential of considerably reducing the computational time required for the optimization exercize. Second, the Smith and Wong (2017) approach is based on optimizing an approximation of the penalty function while we offer two ways of solving the problem, i.e. either by an approximated problem or by an approximation algorithm. We shall see in the following, that even with a small sample, our simulated annealing algorithms perform efficiently and effectively in estimating the parameters of the integral constrained likelihood problem.

Finally, we illustrate the application of our algorithms by using the simulated data generated in Smith et al. (2015) reported in Table 1 below.

Table 1 Dataset#1: Two Simulated Datasets of Size 10, each Drawn from Two Exponentiated Exponential (EE) Distributions with \(\psi _{0}=0.1\), such as EE(2, 3) and EE(5, 1.6015)

The data is generated by drawing random samples of size 10 each from the two EE distributions, EE(2, 3) and EE(5, 1.6015), with \(\psi _{0}=0.1\) as the corresponding value on reliabilityFootnote 4, \(\mathcal {R}\).

Given the above set-up, we executed our stochastic optimization algorithms 10 times each on a penalty function representation of the integral constrained problem 23. Figure 1 displays the convergence towards the optimal Log-likelihood obtained by Smith et al. (2015) of a run, each, of our two algorithms. From this figure there seems to be no difference in the rate of convergence of the algorithms. In contrast with the GIMHSA algorithm, the MCWMSA algorithm requires more time in its execution largely because of the need to compute the penalty function twice at each iteration step (see Eq. 5). In Table 2, we report the parameter estimates obtained by Smith et al. (2015) alongside the statistical results (i.e. mean and standard deviation) obtained from 10 independent runs of our algorithms.

Fig. 1
figure 1

A Trajectory of the Constrained Log-likelihood Functions obtained from the use of MCWMSA (left figure) and GIMHSA (right figure) Algorithms for solving the Integral Constrained Problem

The results in Table 2 indicate that on average the GIMHSA algorithm outperforms other methods as it is the algorithm with the lowest constraint violation (i.e. \(\mathcal {R}-0.1\)) estimate.Footnote 5

Table 2 Parameter Estimates for Dataset#1, comparing CSA of Smith et al. (2015), MCWMSA and GIMHSA Algorithms when solving the Integral Constrained Log-likelihood in Eq. 23 using Dataset#1 as reported in Table 1, with \(a=0.95, N=1,000\), \(T_{0}=10\) and \(T^{*}= 10^{-6}\)
Table 3 Reported Characteristics of the Methods in terms of Performances of the Algorithms on the Dataset#1 of Smith et al. (2015)

4.2 Second application: portfolio optimizations in a simple gaussian world

As a second illustration, we implement our algorithms to a set of portfolio optimization problems with positivity constraint on the weights, namely the Markowitz (1952) mean-variance portfolio optimization with positive constraint on the weights.

The Markowitz mean-variance problem with short-selling constraints is taken up in the section to demonstrate the effectiveness of our algorithms. Using this problem as a benchmark, we are able to assess the quality of the solutions of our algorithms, since an optimal portfolio to the Markowitz mean-variance problem can easily be obtained using, for example, Quadratic Programming (QP).

Suppose we have N assets to choose from, let \(\mathbf{W}^{T}=(w_{1},\dots ,w_{N})\) be the transposed vector of weights with \(w_{i}\in \mathbb {R}\) being the weight of the i-th asset in the portfolio \(\mathbf{X}^{T}=(x_{1},\dots ,x_{N})\). Also, let \(\mathbf{r}^{T}=(r_{1},\dots ,r_{N})\) be a transposed real random vector with \(r_{i}\), \(i=1,\dots , N\), the return on asset i. Furthermore, we define \(\mathbf{r}\) by the following affine transformation:

$$\begin{aligned} \mathbf{r} = {\varvec{\varSigma }}^{1/2}{\varvec{\epsilon }} + {\varvec{\mu }} , \end{aligned}$$
(27)

where \({\varvec{\varSigma }}\in \mathrm {R}^{N\times N}\) is a non-singular, positive definite matrix (called precision matrix), \(\varvec{\mu }\in \mathrm {R}^{N}\) a real vector and \(\varvec{\epsilon }\) a real random vector. Following these assumptions, the random variable \(R\in \mathbb {R}\) representing the return on the whole portfolio can be expressed as:

$$\begin{aligned} R = \mathbf{W}^{T}{} \mathbf{{r}}, \end{aligned}$$
(28)

with the corresponding expected value (E[R]) and variance (var(R)) given as:

$$\begin{aligned} \begin{aligned} E[R]&= \mathbf{W}^{T}(\varvec{\mu } + {\varvec{\varSigma }}^{1/2}E[\varvec{\epsilon }] )\\&= \mathbf{W}^{T}\left( \varvec{\mu } + {\varvec{\varSigma }}^{1/2} \int \varvec{\epsilon } f(\varvec{\epsilon }) d\varvec{\epsilon }\right) , \end{aligned} \end{aligned}$$
(29)

and:

$$\begin{aligned} \begin{aligned} var(R)&=\mathbf{W}^{T}({\varvec{\varSigma }}^{1/2})^{T}var(\varvec{\epsilon }) {\varvec{\varSigma }}^{1/2}{} \mathbf{W}\\&= \mathbf{W}^{T}({\varvec{\varSigma }}^{1/2})^{T} \left( \int \varvec{\epsilon }\varvec{\epsilon }^{T} f(\varvec{\epsilon }) d\varvec{\epsilon } - E[\varvec{\epsilon }](E[\varvec{\epsilon }])^T\right) {\varvec{\varSigma }}^{1/2}\mathbf{W}, \end{aligned} \end{aligned}$$
(30)

where \(f(\cdot )\) denotes the probability density function of \(\varvec{\epsilon }\). Under certain distributional assumptions on \(\varvec{\epsilon }\), solutions to the integrals in Eq. (29) and (30) can be obtained in closed-forms.

Subject to the above assumptions and given a desired level of expected portfolio returns, \(R_{exp}\), the Makowitz mean-variance portfolio selection problem can be summarized as follows:

$$\begin{aligned} \begin{aligned}&\min _{\mathbf{W}\in \mathbb {R}^{N}}~ \mathbf{W}^{T}var[\mathbf{{r}}]\mathbf{W}\\ {\text {subject to}}~&{\left\{ \begin{array}{ll} \mathbf{W}^{T}E[\mathbf{{r}}] = R_{exp}\\ \sum _{i=1}^{N}w_{i}=1\\ ~ w_{i}\ge 0, i=1,\dots , N, \end{array}\right. } \end{aligned} \end{aligned}$$
(31)

where \(\mathbf{W}^{T} = (w_{1},\dots , w_{N})\).

Given the multivariate normal assumption on \(\varvec{\epsilon }\) in the simple Gaussian world of this first application, we numerically evaluate the portfolio expected return and variance following a Monte Carlo procedure. Let \(\varvec{\epsilon }_{i}\), \(i=1,2,\dots , M\) be a sample of size M drawn from the multivariate normal distribution, \(f(\cdot )\), then the integrals in the definition of portfolio expected returns and variance in Eqs. (29) and (30) are evaluated using the empirical counterparts as follows:

$$\begin{aligned} \int \varvec{\epsilon } f(\varvec{\epsilon }) d\varvec{\epsilon } ={E[\varvec{\epsilon }]} \approxeq \dfrac{1}{M}\sum _{i=1}^{M}\varvec{\epsilon }_{i}, \end{aligned}$$

and:

$$\begin{aligned} \int {\varvec{\epsilon }\varvec{\epsilon }}^{T} f(\varvec{\epsilon }) d\varvec{\epsilon } ={E[\varvec{\epsilon } \varvec{\epsilon }^{T}]} \approxeq \dfrac{1}{M}\sum _{i=1}^{M}\varvec{\epsilon }_{i}\varvec{\epsilon }_{i}^{T}. \end{aligned}$$

The consistency of the Monte Carlo approximation, for \(M\longrightarrow \infty \), is guaranteed by the strong law of large numbers (see Robert and Casella (2013), Sect. 3.3, for further details).

We explore some of the properties of the portfolio optimization problem and propose a slight variation of the sampling procedure described in Section 3.1.3 by limiting our algorithms to the consideration of solutions that strictly satisfy the return and the budget constraints. More precisely, we follow the proposal by Crama and Schyns (2003) who state that in problem 31 given a portfolio \(\mathbf{W}\), the neighborhood of the current \(\mathbf{W}\) contains all solutions \(\mathbf{W}^{*}\) with the following property: there exist three assets, labeled \(i_{1}\), \(i_{2}\) and \(i_{3}\), without loss of generalityFootnote 6, so that:

$$\begin{aligned} {\left\{ \begin{array}{ll} w_{i_{1}}^{*} = w_{i_{1}} - S,\\ w_{i_{2}}^{*} = w_{i_{2}} + S\alpha ,\\ w_{i_{3}}^{*} = w_{i_{3}} + S(1-\alpha ),\\ w_{i}^{*} = w_{i},~~\forall ~ i\in \{1,\dots , N\}-\{i_{1}, i_{2}, i_{3}\}, \end{array}\right. } \end{aligned}$$
(32)

where S is a Cauchy distributed random variable and \(\alpha \) a tuning parameter set equal to:

$$\begin{aligned} \alpha = \dfrac{r_{i_{1}} - r_{i_{3}}}{r_{i_{2}} - r_{i_{3}}}, \end{aligned}$$

where \(r_{i_{1}}\), \(r_{i_{2}}\) and \(r_{i_{3}}\) corresponds to the returns on the three randomly selected assets whose weights are to be modified.

Furthermore, in the presence of a minimum (\({\underline{w}}_{i}\)) and a maximum (\({\overline{w}}_{i}\)) restriction on the proportion of total investment that can be held in asset i (\(i=1,2,\dots ,N\)), then a feasible interval of variation for S to ensure that a move from the current solution W to \(W^{*}\) satisfies the floor and ceiling restrictions can be constructed. Suppose the three assets (say, \(i_{1}\), \(i_{2}\) and \(i_{3}\)) to be modified for a move from W to \(W^{*}\) are known, then we arrive at the following interval of variation for S:

$$\begin{aligned} {\left\{ \begin{array}{ll} w_{i_{1}} - \overline{w}_{i_{1}}< S< w_{i_{1}} - \underline{w}_{i_{1}}\\ \underline{w}_{i_{2}} - w_{i_{2}}< S\alpha< \overline{w}_{i_{2}} - w_{i_{2}}\\ \underline{w}_{i_{3}} - w_{i_{3}}< S(1-\alpha ) < \overline{w}_{i_{3}} - w_{i_{3}}, \end{array}\right. } \end{aligned}$$
(33)

by combining the ceiling and floor restrictions with Eq. (32).

To empirically assess our algorithms on the mean-variance optimization problem, we next consider three databases (US, Chinese, and French stock markets) covering different economic sectors and used in Costola et al. (2022). The daily prices on US, French, and Chinese Stock Markets covers, respectively, the period 19/3/2008 to 31/12/2019, 6/4/2006 to 4/10/2019 and 18/8/2008 to 31/12/2019. Our algorithm finds the exact optimal risk for all values of the targeted expected returns. The algorithm requires between 58 and 70 seconds per portfolio of between 29 and 40 securities within a standard setting (see Table 4). The efficient frontier for these instances are plotted in Fig. (2).

The steps for generating the efficient frontiers in Fig. 2 are briefly presented in Algorithm#1 in Appendix C, following Calvo et al. (2012).

Fig. 2
figure 2

Efficient Frontiers in the US, Chinese and French Equity Markets. Note. Are plotted here the Markowitz Mean-variance Efficient frontiers obtained using: Quadratic programming (Quadratic Prog.), Monte Carlo Within Metropolis Simulated Algorithm (MCWMSA), and Group Independent Metropolis Hastings Simulated Annealing (GIMHSA). See Table 4 for the performances of the various algorithms with some precise examples

Fig. 3
figure 3

Differences in Volatilities between GIMHSA/MCWMSA and Quadratic Programming for Each Level of Return and Different Markets

Table 4 Reported Characteristics of the Methods in terms of Differences in Volatility (by Level of Expected Return) versus the QP and Main Reported Performance of the Algorithms for Three Arbitrary Levels of Expected Returns (for each considered market)

The solutions obtained with the two algorithms are very close to the ones resulting from QP for the three stock markets considered, as simple eyeball analyses of Figs. 2 and 3 reveal. The difference in the optimal portfolio variance obtained with QP and SA for a given targeted expected return is of the order of \(10^{-4}\) (see Fig. 3), which is negligible for the applications of SA to asset allocation problem.

4.3 Third application: mean-expected shortfall problems in a non-gaussian world

As a third illustration, we implement our algorithms in the case of the Mean-Expected Shortfall problem (as in Casarin and Billio (2007)), when the data are Non-Gaussian and the constraint a true integral, for which few approaches have been developed in the literature to the best of our knowledge. As previously, the return vector \(\mathbf{r}^{T}=(r_{1},\dots ,r_{N})\) with \(r_{i}\), \(i=1,\dots , N\), being the return on asset i, is defined by the affine transformation in Eq. 27 and the corresponding portfolio return, \(R\in \mathbb {R}\), is given in Eq. 28.

Considering the Expected Shortfall of the portfolio, \(ES_{\alpha }(R)\), as the measure of risk, the goal is to maximize the E[R] subject to \(ES_{\alpha }(R)\) among several other constraints. This problem fits the form of Eq. 1, so that, for \(\alpha \in [0,1]\):

$$\begin{aligned} \begin{aligned}&\max _{\mathbf{W}\in \mathbb {R}^{N}}~ \mathbf{W}^{T}E[\mathbf{{r}}]\\ {\text {subject to}}~&{\left\{ \begin{array}{ll} ES_{\alpha }(R)\le \underline{R}\\ \sum _{i=1}^{N}w_{i}=1\\ {w_{i}\ge 0, \forall i=1,\dots , N,} \end{array}\right. } \end{aligned} \end{aligned}$$
(34)

where:

$$\begin{aligned} ES_{\alpha }(R) = \dfrac{1}{\alpha }\int _{0}^{\alpha }VaR_{u}(R)du, \end{aligned}$$

and \(\underline{R}\) a return threshold, and with:

$$\begin{aligned} \begin{aligned} VaR_{u}(R)&= \min \{x\in \mathrm {R}: Pr(x+R<0)\le u\}\\&= \min \{x\in \mathrm {R}: Pr(-R>x)\le u\}\\&= \min \{x\in \mathrm {R}: Pr(L\le x)\ge 1-u\}\\&= F_{L}^{-1}(1-u), \end{aligned} \end{aligned}$$

where \(F_{L}(x) = Pr(L\le x)\).

Except for some ad hoc densities, the exact formula for the ES is not available, which often renders the portfolio selection problem a challenging exercize to undertake.

In our case, suppose the vector of returns on the assets satisfies the affine transformation defined by Eq. 27, then the Expected Shortfall on the portfolio return leads to:

$$\begin{aligned} \begin{aligned} ES_{\alpha }(R)&= - \mathbf{W}^{T}\varvec{\mu } + \dfrac{1}{\alpha }\int _{0}^{\alpha } F_{-\mathbf{W}^{T}\varSigma ^{1/2}{\varvec{\epsilon }}}^{-1}(1-u)du. \end{aligned} \end{aligned}$$

For illustrative purposes, we now assume that the vector of asset returns is characterized by a class of Multivariate Skew Student-t distributions which accounts for fat tails and skewness, in the more realistic Non-Gaussian world of this second application. More precisely, we assume that the random vector \({\varvec{\epsilon }}^{T}=(\epsilon _{1},\dots ,\epsilon _{N})\) in Eq. 27 follows a Multivariate Skew Student-t distribution, \(Sk\mathcal {T}_{N}(\mathbf{\varvec{\epsilon }};0,I_{N},\nu ,\gamma )\) with \(\nu =\{\nu _{1},\dots ,\nu _{N}\}\) the vector of degrees of freedom and \(\gamma = \{\gamma _{1},\dots , \gamma _{N}\}\) the vector of parameters that control the skewness of the distribution. The components \(\epsilon _{i}\) of \({\varvec{\epsilon }}\), \(i=1,2.\dots ,N\), are independent and Skew Student-t distributed with density:

$$\begin{aligned} g(\epsilon _{i};\nu _{i},\gamma _{i}) = \dfrac{2}{\gamma _{i} + 1/\gamma _{i}}\left[ h_{\nu _{i}}\left( \frac{\epsilon _{i}}{\gamma _{i}} \right) \mathrm {I}_{[0,+\infty )}(\epsilon _{i}) + h_{\nu _{i}}\left( u_{i}\gamma _{i} \right) \mathrm {I}_{(-\infty ,0]}(\epsilon _{i}) \right] , \end{aligned}$$
(35)

and:

$$\begin{aligned} h_{\nu _{i}}(x) = \dfrac{\varGamma {((\nu _{i} +1)/2)}}{\varGamma {(\nu _{i}/2)}\sqrt{(\pi \nu _{i})} }\left( 1+\dfrac{x^{2}}{\nu _{i}} \right) ^{-(\nu _{i} + 1)/2} \mathrm {I}_{(-\infty ,\infty )}(x)~~{\text {with}}~~\nu _{i}>2, \end{aligned}$$
(36)

the density of a univariate Symmetric Student-t distribution. The mean and variance of \(\epsilon _{i}\) are respectively given as follows:

$$\begin{aligned} E[\epsilon _{i}] = \dfrac{\gamma _{i}^{2}-1}{\gamma _{i}}\dfrac{ \varGamma ((\nu _{i}-1)/2 )}{\varGamma {(\nu _{i}/2 )}}\left( \dfrac{\nu _{i}}{\pi }\right) ^{1/2}, \end{aligned}$$
(37)

and:

$$\begin{aligned} V(\epsilon _{i}) = \dfrac{\gamma _{i}^{4}-\gamma _{i}^{2}+1}{\gamma _{i}^{2}}\left( \dfrac{\nu _{i}}{\nu _{i}-2} -\dfrac{\nu _{i}}{\pi }\left( \dfrac{\varGamma ((\nu _{i}-1)/2 )}{\varGamma {(\nu _{i}/2 )}} \right) ^{2} \right) + \dfrac{\nu _{i}}{\pi }\left( \dfrac{\varGamma ((\nu _{i}-1)/2 )}{\varGamma {(\nu _{i}/2 )}} \right) ^{2}. \end{aligned}$$
(38)

Subject to the above assumptions, it can easily be shown that \(\mathbf{r}\) follows a Multivariate Skew Student-t distribution, \( Sk\mathcal {T}_{N}(\mathbf{r};\varvec{\mu },\varvec{\varSigma },\nu ,\gamma ) \), with a Variance-CoVariance of returns matrix given by:

$$\begin{aligned} V(\mathbf{r}) = {\varvec{\varSigma }}^{1/2}V({\varvec{\epsilon }}){\varvec{\varSigma }}^{1/2} = {\varvec{\varSigma }}^{1/2}\varvec{\varXi }(\nu ,\gamma ){\varvec{\varSigma }}^{1/2}, \end{aligned}$$
(39)

where \(\varvec{\varXi }(\nu ,\gamma )\) is a diagonal matrix parametrized in \(\nu \) and \(\gamma \), with elements \(\xi _{i}=V(\epsilon _{i})\) on the main diagonal.

To simulate asset returns from a Skew Student-t distribution with \(\nu \) degrees of freedom and skewness \(\gamma \), we use the following stochastic representation. Let Z have a Bernoulli distribution with parameter \(1/(\nu ^{2} + 1)\) and X a Student-t distribution with \(\nu \) degrees of freedom, then we have:

$$\begin{aligned} Y = Z|X|\gamma - (1-Z)|X|\dfrac{1}{\gamma }. \end{aligned}$$
(40)

To facilitate the implementation of our optimization strategy, we embed the following Monte Carlo evaluation of ES within our optimization scheme as briefly presented here below.

  1. 1.

    Draw \(\varvec{\epsilon }_{j}^{T}= (\epsilon _{j1},\epsilon _{j2},\dots ,\epsilon _{jN})\sim Sk\mathcal {T}_{N}(\varvec{\epsilon }_{j};0,\mathrm {I}_{N},\nu ,\gamma )\), for \(j=1,\dots ,M\).

  2. 2.

    Compute \({-\mathbf{W}^{T}{\varvec{\varSigma }}^{1/2}{\varvec{\epsilon }}_{j}}\) for \(j=1,2,\dots ,M\).

  3. 3.

    Denote \(L_{j} = -\mathbf{W}^{T}{\varvec{\varSigma }}^{1/2}{\varvec{\epsilon }}_{j}\). If we order the sample \((L_{1},\dots , L_{M})\) so that \(L_{1,M}\ge L_{2,M}\ge \dots \ge L_{M,M}\) and:

    $$\begin{aligned} \begin{aligned} P(L\le x)&= \int \mathrm {I}_{\{L\le x\}}f_{L}(l)dl = E[\mathrm {I}_{\{L\le x\}}]\\&\approxeq \dfrac{1}{M} \sum _{j=1}^{M}\mathrm {I}_{\{L_{j}\le x\}}, \end{aligned} \end{aligned}$$

    that is:

    $$\begin{aligned} {\hat{P}}(L\le x) = \dfrac{1}{M}\sum _{i=1}^{M}\mathrm {I}_{\{L_{i,M}\le x\}}. \end{aligned}$$
  4. 4.

    Evaluate:

    $$\begin{aligned} ES_{\alpha }(R)= & {} - \mathbf {W}^{T}\varvec{\mu } + \dfrac{1}{\alpha }\int _{0}^{\alpha }\sum _{i=1}^{M-1} \left( L_{i,M} \mathrm {I}_{(\frac{i-1}{M}, \frac{i}{M}]}(u) + L_{M,M} \mathrm {I}_{(\frac{M-1}{M}, 1]}(u)\right) du\nonumber \\= & {} - \mathbf {W}^{T}\varvec{\mu } + \dfrac{1}{M\alpha }\left( \sum _{i=1}^{{\left\lfloor M\alpha \right\rfloor }}L_{i,M} + \left( M\alpha -{\left\lfloor M\alpha \right\rfloor } \right) L_{{\left\lfloor M\alpha \right\rfloor }+1 ,M} \right) , \end{aligned}$$
    (41)

    where \({\left\lfloor x \right\rfloor }\) and \(\mathrm {I}_{(a,b)}\) denote the greatest integer function and the indicator function on the interval (ab), respectively.

As a first step in solving the portfolio optimization in Eq. 34, we consider a revised problem by exploring the convexity property of Expected Shortfall, i.e.:

$$\begin{aligned} ES_{\alpha }(R) \le \sum _{i}w_{i}ES_{\alpha }(r_{i}), \end{aligned}$$

and solve the following problem:

$$\begin{aligned} \begin{aligned}&\max _{\mathbf {W}\in \mathbb {R}^{N}}~ \mathbf{W}^{T}E[\mathbf {r}]\\ {\text {subject to}}~&{\left\{ \begin{array}{ll} \sum _{i}w_{i}ES_{\alpha }(r_{i})\le {\underline{R}}\\ \sum _{i=1}^{N}w_{i}=1\\ {w_{i}\ge 0, \forall i=1,\dots , N.} \end{array}\right. } \end{aligned} \end{aligned}$$
(42)

A pseudo-code for the Expected Shortfall efficient frontiers in Fig. 4 is given in Algorithm#1 in Appendix C.

In-view of the above set-up we next illustrate further the effectiveness of our SA type algorithms in two settings. The first exercize studies the effect of heavy tails and skewness of the asset distribution on the ES efficient frontier within a strategic allocation problem. The second one shows the effect of model specification errors on the ES efficient frontier.

The first exercize considers a strategic asset allocation problem similar to the one in Casarin and Billio (2007). Strategic allocation decisions refer to long-term investments and usually involve a small number of assets, often represented as the benchmark of different asset classes. Long-term investors can mitigate the adverse effect arising from their exposure to the increasing volatility of the financial markets by imposing a constraint on their portfolio’s downside risk.

Table 5 presents the characteristics of the three asset classes used in Casarin and Billio (2007). Figure 4 displays the Mean-Expected Shortfall Efficient frontiers under different parameter settings of the Skew Student-t distribution.

Table 5 Data Characteristics of Dataset#5 and Application#3 (Long-term Mean-Expected Shortfall Optimizations)
Fig. 4
figure 4

Mean-Expected Shortfall Efficient Frontiers according to various Distributional Assumptions for the Returns for Dataset#5 and Application#3

The mean and scale parameters are the same across frontiers while the skewness and kurtosis vary across the frontiers. Our numerical results show that, for a given expected return level, the portfolio risk of normally distributed asset returns is lower than those of heavy-tailed asset returns (see Symmetric Normal and Symmetric Student-t frontiers in Fig. 4). For both normal and heavy tail distributions, the presence of (left) asymmetry in the distribution implies a larger risk when compared to the case of symmetric distribution (see Skew Normal and Skew Student-t frontiers in Fig. 4).

In the second exercize dedicated to Mean-Expected Shortfall optimization, we consider the asset classes used in the long-term global asset allocation with model risk studied in Boucher et al. (2013). The asset classes are represented by cash, stock and bond markets.Footnote 7 The main characteristics in terms of means and variance-covariances are presented in Table 6.

Table 6 Data Characteristics of Dataset#6 and Application#3 (Long-term Mean-Expected Shortfall Optimization)

We first assume the investor is using a non-correctly specified model (a Normal distribution) for the returns, plugged into the ES optimization problem. Then, we evaluate the actual (realized) expected return and risk using the best flexible model (that is a Skew Student-t distribution) fitted on the real market data on the entire sample. A pseudo-code for the Expected Shortfall efficient frontiers with estimation errors (Fig. 5) is provided in Algorithm#1 in Appendix C.

Finally, we compare the two ex post biased frontiers with the optimal one obtained from the problem 42 with the best flexible model. In Fig. 5, the Expected Return and ES in the Normal case differ from the Skew Student-t since the Normal distribution assumption for the entire sample period is too restrictive and suffers from an estimation bias. As expected the Symmetric Normal frontier is above the two Skew Student-t frontiers (because of the better fitting of asymmetry and tails provided by the Skew Student-t). We shall also note that asymmetry and tail estimates are crucial for a correct evaluation of the frontier since the differences of the two Skew Student-t frontiers are small, while the differences between the Symmetric Normal and the Skew Student-t ones are large.

In this case (Dataset#6 and Application#3), as in all our previous exercizes (Dataset#1-#5 and Application#1-#3), the proposed MCWSA and GIMHSA algorithms reach convergence and are able to efficiently provide optimal solutions within a reasonable and realistic computational time for a practitioner.

Fig. 5
figure 5

Mean-Expected Shortfall Efficient Frontiers according to various Estimation Error Assumptions for the Returns for Dataset#6 and Application#3. Note. Are plotted here three Mean-ES efficient frontiers. The two first ones are obtained by using the incorrect return distribution (Symmetric Normal mean and ES in dashed black: “Sym. Normal Biased Efficient Frontier”), and the correctly estimated distribution (Skew Student-t mean and ES in dotted-dashed blue: “Skew Student-t Unbiased Efficient Frontier”). The third one is obtained using the optimal weights corresponding to the incorrect allocations (Symmetric Normal) all together with the mean and ES under the Skew Student-t assumption (solid red line: “Skew Student-t Biased Efficient Frontier”), where degrees of freedom and skewness parameters are estimated on the data

5 Conclusion

The problem of designing efficient algorithms for optimizing a non-differentiable function subject to an integral constraint whose solution is not available in a closed-form is an important one, in particular in Operational Research and more specifically in Finance. In this article, we propose two new stochastic optimization algorithms referred to as Monte Carlo within Metropolis Simulated Annealing (MCWMSA) and Group Independent Metropolis Hastings Simulated Annealing (GIMHSA), which both rely on the use of importance sampling. The latter employed Monte Carlo simulations in approximating the analytically intractable integral constrained problem, before using a Simulated Annealing-type algorithm, while the former used the Monte Carlo strategy in constructing an approximate algorithm. We were first able to prove some convergence properties of the two algorithms. We show that the sequence of samples generated by MCWMSA and GIMHSA is an ergodic Markov Chain process and the limiting distribution is the true target function.

We study first the performance of our algorithms by applying them to the problem of maximizing the integral constrained likelihood for the stress-strength reliability with independent exponentiated exponential distributions, as discussed in Smith et al. (2015). Our results confirm the effectiveness of our algorithms in solving the integral constrained problem. It also happens that GIMHSA gives the best results in the sense that it produces the lowest constraint violations.

As a complement, we secondly provide examples of potential uses in finance. A first simple benchmark show that they provide accurate approximations of the true solutions in the case of the classical portfolio optimization problem of Markowitz (1952), in which the constrained integral simply concerns the constrained expected return. We further confirm this basic result by numerically solving the positively constrained Expected Return / Expected Shortfall strategic asset allocation problems (as in Casarin and Billio (2007)). In the this case, efficient results are obtained. We also complement this later result using the dataset used in Boucher et al. (2013) for the sake of robustness and show the impact of some potential estimation errors on the estimated frontiers. Our results finally show that the two proposed algorithms quickly converge to practical optimal solutions, within a decent and operational computational time for risk and asset managers.

We can imagine several extensions of this work. On the algorithmic added-value of the article, we may think about parallel GIMSA with and without interactions. Also, application of the proposed algorithms to mixed-integer programming problems, such as the portfolio selection problem with or without a cardinality constraint (see Woodside-Oriakhi et al. (2011)) will be consider in further studies.

Regarding financial applications, competing with existing numerical approximations for VaR and ES when Cumulative Density Function analytical expressions are unknown, will be a decent challenge (see, for instance, Azzalini and Valle (1996) Skew Normal distribution). To complement this, the recently proposed joint approach for estimating VaR and ES (see Wang et al. (2018); Meng and Taylor (2020)) might be a target of interest for testing further our proposed improved approaches. Finally, tackling the full constrained problem of portfolio optimization in a dynamic setting would be beneficial for investors (see Costola et al. (2022)).