1 Introduction

Multistage stochastic optimization problems (MSOP) have been actively used in energy planning and finance to make decisions in a dynamic and stochastic setting. For big MSOP, one has to choose a compromise between (i) solving the problem in reasonable time and computational costs and (ii) representing uncertainty in detail. Therefore, one naturally asks how to quantify the effect of the scenarios on a MSOP so that it is possible to decide whether or not to improve uncertainty representation given that the solution of one instance of the MSOP in consideration is extremely costly. This paper helps to deal with this question.

While the general MSOP is computationally intractable, the assumption that the stochastic process has stage-wise independent realizations is reasonable in some settings and induces great simplifications to solution methods via the technique known as cut-sharing. Important algorithms leveraging on this technique are the SDDP [1] and the CUPPS [2].

Cut-sharing is based on realizing that value functions representing future costs in MSOP are the same when the realizations are stage-wise independent. Therefore, there is no need to visit all branches of the tree to improve the current cutting planes representation of the value functions. This is the source of the enormous reduction on the computational cost per iteration made possible by SDDP. CUPPS went further with the cut-sharing principle by sharing cuts among realizations in the same stage when there is only right-hand side (RHS) uncertainty. Cut-sharing for some forms of inter-stage dependence on the realizations is shown in [3].

The most used algorithms for multistage stochastic optimization [1, 2, 4, 5] consider given a fixed tree of scenarios. However, specially for high dimensional stochastic and volatile processes, the issue of the statistical significance of the solution found using one tree is important [6, 7]. To approximate and evaluate the quality of candidate solutions of problems with continuous distributions many techniques based on solving separately many sampled trees (sequential sampling) have been introduced [8,9,10]. As far as we know, none of these techniques leverage on cut-sharing across trees to make faster inferences. Instead, they develop statistical theory to make inference about solutions found separately by traditional algorithms [1, 5]. In contrast, we get more lower bounding information from the solution of a single tree.

In general sequential sampling, solving many trees from scratch might be unavoidable as a conservative decision to check if the solution obtained is good enough or not. This is so because assumptions of theorems for asymptotic statistics or convergence of optimization methods might not hold. This comes sometimes with heavy costs because SDDP simulations can be quite time consuming and expensive when clusters are used to solve big problems. In this sense, being able to evaluate the sensitivity of general trees with respect at least to RHS vectors might be helpful because it would concentrate computationally hard sampling on other components of the uncertainty. For this purpose, our technique might be helpful. Moreover, the need to solve bigger problems is always standing and therefore efficiency on the solution of important special cases is always welcome.

Consider a SDDP problem that depends on sampled data. We want to stress that sometimes it is not practical to solve entirely new SDDP problems for each sample drawn as required by some sequential sampling methods. On the other hand, some estimation of the statistical significance or sensitivity of the lower bound for a given sampled tree is required. Our method can provide this sensitivity at a minor cost. This is clearly desirable information in practice, since the alternative would be either to solve more SDDP problems or trust the result obtained with the first tree. In summary, some sensitivity information might be better than more precise information at very high computational costs or no sensitivity information at all. For instance, in [5, 8] the authors employ a computational cluster to solve many two-stage stochastic problems also with RHS uncertainty. Using our technique, we can share cuts between instances and perform statistical inference more cheaply.

In spite of not being explored in enough detail in this manuscript, computationally cheap upper bound estimates for different trees (other than the initial one) can also be obtained with our developments as is explained opportunely. Comparatively to resolving the new trees from scratch with the traditional SDDP, the cost of the so-called backward pass is removed, which is most expensive part of the method. This is possible because the approximations that we provide for the cost-to-go functions are valid for all stages and all trees with the same amount of scenarios. Therefore, a sequence of the so-called forward passes can be performed to obtain the desired statistical upper bounds.

In some sense, our computational approach is based on trading the costly exact evaluations of optimal values used in sequential sampling with fast estimates of the lower bounds of these optimal values over more trees. The lower bounds computed might be very loose if the total number of scenarios per stage is not enough to represent the continuous problem. However, if we conclude that these lower bounds do not change much across many other trees, we can also conclude that solving more trees from scratch is useless as a strategy to improve current confidence intervals because the first optimal value computed has essentially no sensitivity to the tree being used. In this sense, our contribution is more a tool to state that the true model is reasonably approximated. Nonetheless, the lower bounds can be used as a proxy to build confidence intervals for the true optimal value.

Recently, the topic of sensitivity and duality of multistage stochastic optimization gained some attention [11,12,13] with the development of the dual SDDP algorithm (see also [14, 15]). The motivation of these new developments was to improve the computation of the upper bound on the traditional SDDP method as also considered in [16]. The method of [12] focuses only on RHS uncertainty as we do, while [11] is for general stage-wise independent uncertainty. In principle, [12] should have applications in efficient sequential sampling for MSO. Using the dual SDDP one can apply our techniques to get fast upper bound estimates on the optimal value of a tree under perturbations of the cost vectors as was initially pursued using other strategies in [17] for the sensitivity of energy contracts to some prices.

The approach used in [17] and also in [11] is based on using Danskin’s theorem [18] to estimate directional derivatives of the optimal value. However, they obtain a lower estimate on the directional derivative using the primal optimal solution computed, because Danskin’s theorem requires the knowledge of all primal optimal solutions. Precisely, given a nonempty compact set Z and a directionally differentiable function \(\phi (x, z)\) that is convex in x for all z, we can define the value function

$$\begin{aligned} f(x) = \max \lbrace \phi (x, z) : z \in Z \rbrace \end{aligned}$$

and the associated set of optimal solutions

$$\begin{aligned} Z^*(x) = \lbrace z \in Z : \phi (x, z) = f(x) \rbrace . \end{aligned}$$

Danskin’s theorem states that the directional derivative \(f'(x; d)\) of f in the direction d is given by

$$\begin{aligned} f'(x; d) = \max \lbrace \phi '(x,z; d) : z \in Z^*(x)\rbrace . \end{aligned}$$

This approach is related to ours as we further explain now. Instead of trying to estimate the directional derivative of the optimal value with respect to problem data, we estimate a global lower bound on the optimal value with a max-type function (the cutting planes model). The cutting planes model is convex because it is given by a maximum of finitely many affine functions as in

$$\begin{aligned} m(x) = \max _i \lbrace a_i + b_i^\intercal x : i=1,\dots ,I \rbrace . \end{aligned}$$

As is well known, the directional derivatives of max-type functions can be computed easily by identifying the so-called active set (explained in Sect. 2). Therefore, we can also get an estimate on the directional derivative of the true value function given a specific direction similarly to [17]. Instead of fixing a direction, we are actually interested in evaluating the global lower bounds given by the cutting planes model as we expect these lower bounds to show small variance as a sign of good approximation of the true problem.

Moreover, it seems to us that using the individual estimates of the directional derivatives at all directions as in [17] could be suboptimal when there are lots of possible directions because (i) analyzing more numbers is harder and (ii) the joint effect of the directional derivatives is better understood when they are plugged together in a cutting planes model that gives one single estimate for the optimal value as opposed to Tables 6 and 7 of [17].

In summary, our method has some similarities with [17], but with more focus on cutting plane models and calculations of guaranteed lower bounds. The cutting planes are used to summarize nonsmooth derivative information, jointly with a strategy to replicate RHS parameters backwards explained opportunely. This strategy makes the lower bound calculations we advertise possible at the expense of an increase in memory usage proportional to the number of stages and scenarios of the discretized SDDP problem. The same idea of free-floating cuts that we employ here to build the algorithm is also used to solve some multilevel and multistage stochastic equilibrium problems [19]. Moreover, our work can also be seen as an extensive use of the so-called cut transportation formulas, which are employed in [20] as a means of reusing already computed cuts.

In Sect. 2 we make some preliminary considerations. In Sect. 3 we show our approach for the two-stage case. In Sect. 4 we show the multistage case. In Sect. 5 we report the numerical experiments supporting our methods. In Sect. 6 we show extensions based on the dual SDDP.

2 Preliminaries

The subdifferential [21] of a finite-valued convex function P at a point z is defined as

$$\begin{aligned} \partial P (z) = \lbrace \gamma : P(z') \ge P(z) + \gamma ^\intercal (z' - z) \quad \forall z' \rbrace . \end{aligned}$$

The directional derivative \(P'(z; d)\) of P at z and a direction d always exists and is such that

$$\begin{aligned} P'(z; d) = \max \lbrace \gamma ^\intercal d : \gamma \in \partial P(z) \rbrace . \end{aligned}$$

We denote by \(\text{conv } D\) the smallest convex set containing D. The connection of the directional derivative of P can be made with the cutting planes model as follows. If we consider a max-type function

$$\begin{aligned} P(z) = \max \lbrace f_j(z) : j=1,\dots ,J\rbrace \end{aligned}$$

where \(f_j\) are smooth convex functions, it is well known that

$$\begin{aligned} \partial P(z) = \text{conv} \lbrace \nabla f_j(z) : f_j(z) = P(z) \text{ for } j=1,\dots ,J \rbrace . \end{aligned}$$

In this section we consider the widely used convex value function given by

$$\begin{aligned} Q(x,h) := \min _{y \ge 0} \lbrace c^\intercal y : Wy = Rh - Bx \rbrace . \end{aligned}$$
(1)

Since Q is finite-valued (by assumption) and the underlying optimization problem is linear, we know there are optimal primal and dual solutions. For a fixed pair \(x = \hat{x}\) and \(h = \hat{h}\) we can compute an optimal Lagrange multiplier \(\hat{\lambda }\) of the equality constraints associated with the optimization problem defining \(Q(\hat{x}, \hat{h})\). Doing so for a fixed \(h = \hat{h}\) yields the widely used formula

$$\begin{aligned} Q(x, \hat{h}) \ge Q(\hat{x}, \hat{h}) - \hat{\lambda }^\intercal B (x - \hat{x}) \quad \forall x. \end{aligned}$$
(2)

Nonetheless, with the same multiplier \(\hat{\lambda }\) already obtained and applying the same principles used to derive (2) we can obtain the more general inequality

$$\begin{aligned} Q(x, h) \ge Q(\hat{x}, \hat{h}) - \hat{\lambda }^\intercal B (x - \hat{x}) + \hat{\lambda }^\intercal R (h - \hat{h}) \quad \forall x, \quad \forall h. \end{aligned}$$
(3)

The derivation of (3) is as follows. Since (2) holds for any linear problem with the same structure, we can change the problem data in (1) so that we obtain (3) as an application of (2). Let us take \(z = (x, h)\), \(\tilde{B} = (B, -R)\) and \(\tilde{R} = 0\). Then, applying (2) to

$$\begin{aligned} \tilde{Q}(w, g) := \min _{y \ge 0} \lbrace c^\intercal y : Wy = \tilde{R} g - \tilde{B}z \rbrace \end{aligned}$$

at the point \(\hat{z} = (\hat{x}, \hat{h})\) and \(\hat{g} = 0\) gives a multiplier \(\hat{\lambda }\) such that

$$\begin{aligned} \tilde{Q}(z, \hat{g}) \ge \tilde{Q}(\hat{z}, \hat{g}) - \hat{\lambda }^\intercal \tilde{B} (z - \hat{z}) \quad \forall z. \end{aligned}$$
(4)

Substituting \(z = (x, h)\) and using \(\tilde{B} = (B, -R)\), we obtain from (4) inequality (3).

Therefore, defining \(Q_{\hat{h}}(x) = Q(x, \hat{h})\), it follows respectively, from (2) and (3), that

$$\begin{aligned} -B^\intercal \hat{\lambda } \in \partial Q_{\hat{h}}( \hat{x}) \quad \text{and} \quad (-B^\intercal \hat{\lambda }, R^\intercal \hat{\lambda }) \in \partial Q(\hat{x}, \hat{h}). \end{aligned}$$

Let us assume now that Q(xh) represents a cost-to-go function [1, 2] used in a cutting planes method (CPM). In this case, x is a decision from the master problem of the CPM and h is a scenario (problem data). The value of h is not decided inside the CPM, only the value of x. Schematically, we consider in this paper a family of nonsmooth convex problems indexed by h and given by

$$\begin{aligned} \text{CPM(}h\text{)}: \quad \min _x \quad Q(x, h) \quad \text{s.t.} \quad x \in D. \end{aligned}$$
(5)

The algorithms proposed consist of solving a base problem CPM(\(\hat{h}\)), and then inferring the optimal values of the other problems CPM(h) with a minor computational cost. For any \(h = \tilde{h}\), the cuts actually used for solving or estimating the optimal value of (5) are

$$\begin{aligned} Q(x, \tilde{h}) \ge U(\hat{x}, \hat{h}, \tilde{h}) - \hat{\lambda }^\intercal B (x - \hat{x}) \quad \forall x, \quad \text{where} \quad U(\hat{x}, \hat{h}, \tilde{h}) = Q(\hat{x}, \hat{h}) + \hat{\lambda }^\intercal R (\tilde{h} - \hat{h}). \end{aligned}$$
(6)

The sample average approximation used on this work consists in changing problem

$$\begin{aligned} \mathcal{Q}^* \quad := \quad \min _x \quad \mathbb{E}_h Q(x, h) \quad \text{s.t.} \quad x \in D \end{aligned}$$
(7)

by problem

$$\begin{aligned} \mathcal{Q}(\hat{h}_1, \dots , \hat{h}_S) \quad := \quad \min _x \quad S^{-1}\sum _s Q(x, \hat{h}_s) \quad \text{s.t.} \quad x \in D. \end{aligned}$$
(8)

Let \(\underline{\mathcal{Q}}\) be the expectation of \(\mathcal{Q}(\hat{h}_1, \dots , \hat{h}_S)\), whenever properly defined. One natural question is if \(\mathcal{Q}^* = \underline{\mathcal{Q}}\). In general, it only holds that \(\mathcal{Q}^* \ge \underline{Q}\). See [22, 23] for details. Moreover, a theory relating \(\mathcal{Q}^*\) and \(\underline{\mathcal{Q}}\) is shown in [24, Chapter 5], more specifically in [24, Theorem 5.7].

This paper provides a technique to compute a lower bound for the true optimal value \(\min _x \mathcal{Q}(x)\) by giving a lower bound on \(\underline{Q}\).

While (7) cannot be solved in general, problem (8) can in most cases. However, we are left with the dependency on the sample \((\hat{h}_1, \dots , \hat{h}_S)\), which does not appear on (7). Therefore, we would expect that if indeed (8) approximates (7), then we would observe that the sensitivity of (8) with respect to \((\hat{h}_1, \dots , \hat{h}_S)\) becomes smaller when S grows big enough. This is indeed verified experimentally on Sect. 5.

Note that (8) resembles (5), which means that with the developments of this paper we can analyze empirically how well (8) approximates (7) by understanding how lower bounds to the optimal value of (8) change when the sample \((\hat{h}_1, \dots , \hat{h}_S)\) changes. The later entails estimating the optimal values of a family of problems indexed by \(h = (h_1, \dots , h_S)\) as described on (5). This estimation has to be performed with reasonable computational costs and time.

3 Two-stage stochastic problems

The traditional two-stage stochastic problem (2TSP) consists of S scenarios in the second stage represented by a sequence \((h_{21}, \dots , h_{2S})\) of RHS vectors. An actual sample of this RHS sequence is denoted by \((\hat{h}_{21}, \dots , \hat{h}_{2S})\). It consists in solving

$$\begin{aligned} \min _{x} \quad c_1^\intercal x + S^{-1} \sum _{s=1}^S Q_s(x, \hat{h}_{2s}) \quad \text{ s.t. } \quad x \in D_1 := \lbrace x \ge 0 : W_1 x = h_1 \rbrace \end{aligned}$$
(9)

where for \(s=1,\dots ,S\) we take

$$\begin{aligned} Q_s(x, h_{2s}) := \min _{y_s \ge 0} \lbrace c_2^\intercal y_s : W_2 y_s = h_{2s} - B_2 x \rbrace . \end{aligned}$$

For convenience, it useful to define the full cost-to-go function given by

$$\begin{aligned} Q(x, h_{21}, \dots , h_{2S}) := S^{-1} \sum _{s=1}^S Q_s(x, h_{2s}). \end{aligned}$$
(10)

The 2TSP with continuous expectation in the objective is considered in [6, 7]. To solve the continuous problem, often one tries to analyze (9) for different samples \((\hat{h}_{21}, \dots , \hat{h}_{2S})\) and values S. Our procedure consists in analyzing quickly and empirically by how much the optimal value of (9) depends on the sample \((\hat{h}_{21}, \dots , \hat{h}_{2S})\) for a fixed S because the expectation of the true problem, as shown in (7), is not sample-dependent.

The value function \(Q_s(x, h_{2s})\) is jointly convex with respect to x and \(h_{2s}\). The cut calculation is then performed as usual. More precisely, for subgradients \((\alpha _s, \beta _s) \in \partial Q_s(\hat{x} , \hat{h}_{2s})\), we have that

$$\begin{aligned} Q_s(x , h_{2s}) \ge Q_s\left( \hat{x} , \hat{h}_{2s}\right) + (\alpha _s)^\intercal \left( x - \hat{x}\right) + \beta _s^\intercal \left( h_{2s} - \hat{h}_{2s}\right) \quad \forall x, \forall h_{2s}, \forall h_{2s'} \text{ for } s' \ne s. \end{aligned}$$

The last inequality does not depend on \(h_{2s'}\) for \(s' \ne s\). Therefore, it also holds for all \(h_{2s'}\) with \(s' \ne s\). Then, just adding the previous inequalities for all s after multiplying each one by \(S^{-1} > 0\), we obtain for all \(x, h_{21},\dots ,h_{2S}\) that

$$\begin{aligned} \begin{aligned} Q(x , h_{21}, \dots , h_{2S})&= S^{-1} \sum _{s=1}^S Q_s(x , h_{2s}) \\&\ge Q\left( \hat{x} , \hat{h}_{21}, \dots , \hat{h}_{2S}\right) + S^{-1} \sum _{s=1}^S (\alpha _s)^\intercal (x - \hat{x}) + S^{-1} \sum _{s=1}^S \beta _s^\intercal \left( h_{2s} - \hat{h}_{2s}\right) . \\ \end{aligned} \end{aligned}$$
(11)

Note that the last inequality justifies that Algorithm 1 could have used the averaged cut, instead of one cutting planes model for each s. The transition from the scenario-wise cutting planes model to the averaged cutting planes model is done adding the inequalities for each s after weighting by their probabilities.

In what follows we consider (cWB) independent of s. The algorithm for the sensitivity of 2TSPs with respect to the sample \((\hat{h}_{21}, \dots , \hat{h}_{2S})\) as explained by (7) and (8) and nearby text is shown below. It takes as input the matrices defining the value functions \(Q_s(x, h_{2s})\) and the output is a vector of estimates of the optimal value of problem (8) for other samples.

Algorithm 1

(Sensitivity of 2TSPs with respect to RHS vectors)

  • Initialization. Set \(k=1\) and take \(\varepsilon > 0\). Take S and a base sample \((\hat{h}_{21}, \dots , \hat{h}_{2S})\).

  • Step 1: Solve Sample Problem using Modified Cuts.

    • Initialization. Take \(x^k \in D\).

    • Step 1.1: Get Subgradient. For \(s=1,\dots ,S\) compute \((\alpha _s^k, \beta _s^k) \in \partial Q_s(x^k, \hat{h}_{2s})\) based on (3).

    • Step 1.2: Define the Scenario Model. Take

      $$\begin{aligned} Q_s^k(x, h_{2s}) := \max _{i=1,\dots ,k} \lbrace Q_s(x^i, \hat{h}_{2s}) + (\alpha _s^i)^\intercal (x - x^i) + (\beta _s^i)^\intercal (h_{2s} - \hat{h}_{2s}) \rbrace . \end{aligned}$$
      (12)
    • Step 1.3: Define the Aggregate Model. Take

      $$\begin{aligned} Q^k(x, h_{21}, \dots , h_{2S}) := S^{-1} \sum _{s=1}^S Q_s^k(x, h_{2s}). \end{aligned}$$
    • Step 1.4: Next Iterate. Compute \(x^{k+1} \in \arg \min _x \quad c_1^\intercal x + Q^k(x, \hat{h}_{21}, \dots , \hat{h}_{2S}) \quad \text{s.t.} \quad x \in D_1\).

    • Step 1.5: CPM Stopping Test. Go to Step 2 if

      $$\begin{aligned} Q(x^{k+1}, \hat{h}_{21}, \dots , \hat{h}_{2S}) - Q^k(x^{k+1}, \hat{h}_{21}, \dots , \hat{h}_{2S}) \le \varepsilon . \end{aligned}$$
    • Step 1.6: Loop. Set \(k = k+1\) and go back to Step 1.1.

  • Step 2: Evaluate Fast Lower Bounds. For \(l=1,\dots ,L\) sample \((\hat{h}_{21}^l, \dots , \hat{h}_{2S}^l)\) and compute

    $$\begin{aligned} \hat{v}^l := \min _x \quad c_1^\intercal x + Q^k(x, \hat{h}_{21}^l, \dots , \hat{h}_{2S}^l) \quad \text{s.t.} \quad x \in D_1. \end{aligned}$$
    (13)
  • Step 3: Output. Compute statistics of interest over \(\lbrace \hat{v}^l : l=1,\dots ,L \rbrace\).

Step 2 of Algorithm 1 consists in solving many LPs. For the current standards, this is not something to be concerned about. Nonetheless, step 2 can be further simplified computing one lower bounding cut for the value function of the first stage problem as a function of the scenarios \((h_{21}, \dots , h_{2S})\) so that \(\hat{v}^l\) would be estimated evaluating an affine expression, which is absolutely inexpensive.

Precisely, the value function of problem (13) is convex with respect to \((h_{21}, \dots , h_{2S})\) and has the same structure of problem (1). Denoting it by \(U^k(h_{21}, \dots , h_{2S})\), it follows that we can calculate subgradients

$$\begin{aligned} (\sigma ^k_1, \dots , \sigma ^k_S) \in \partial U^k(\hat{h}_{21}, \dots , \hat{h}_{2S}) \end{aligned}$$

such that

$$\begin{aligned} U^k(h_{21}, \dots , h_{2S}) \ge U^k\left( \hat{h}_{21}, \dots , \hat{h}_{2S}\right) + \sum _{s=1}^s \left( \sigma _s^k\right) ^\intercal \left( h_{2s} - \hat{h}_{2s}\right) \quad \forall h_{21}, \dots , h_{2S}. \end{aligned}$$
(14)

Therefore, instead of estimating \(\hat{v}^l\) as \(U^k(\hat{h}_{21}^l, \dots , \hat{h}_{2S}^l)\), we estimate it as the right side of (14), which has the cost of evaluating a linear expression. The dependency of \(U^k(\hat{h}_{21}^l, \dots , \hat{h}_{2S}^l)\) on \(h_1\) can also enter on (14). We should also remark that making this further simplification explained in (14), induces a bias in the estimates of \(\hat{v}^l\) as described in Step 2 of Algorithm 1.

For completeness, let us explain how the subgradients \(\sigma ^k_1, \dots , \sigma ^k_S\) are calculated in practice. As currently stated in Algorithm 1, we are employing the multi-cut version to represent the cutting plane models \(Q_s^k\). By writing problem (13) as an LP we can obtain subgradients \(\sigma ^k_1, \dots , \sigma ^k_S\) associated with \({h}_{21}, \dots , {h}_{2S}\). This has the drawback of making the size of the LP depend on S, since this is a feature of the multi-cut approach. However, this drawback is not very serious, since the multi-cut approach is also used in practice. Nonetheless, this dependency on S is not an issue anyways, since (11) can be used to aggregate the cuts for each scenario, so that the total constraints on (13) do not depend on S.

Note that the actual cuts appearing in step 1.4 are equal to the traditional ones based on formula (2) because \(Q^k(x, h_{21}, \dots , h_{2S})\) is evaluated at \((\hat{h}_{21}, \dots , \hat{h}_{2S})\) and this makes the free-floating part mentioned in (3) vanish, as we further explain now. Because the sample \((\hat{h}_{21}, \dots , \hat{h}_{2S})\) is fixed for all k we have

$$\begin{aligned} Q^k\left( x, \hat{h}_{21}, \dots , \hat{h}_{2S}\right) = S^{-1} \sum _{s=1}^S Q^k_s\left( x, \hat{h}_{2s}\right) \end{aligned}$$

and by (12) it follows that

$$\begin{aligned} \begin{aligned} Q^k_s\left( x, \hat{h}_{2s}\right)&:= \max _{i=1,\dots ,k} \left\{ Q_s\left( x^i, \hat{h}_{2s}\right) + \left( \alpha _s^i\right) ^\intercal \left( x - x^i\right) \right\} + \left. \left( \beta _s^i\right) ^\intercal \left( \hat{h}_{2s} - \hat{h}_{2s}\right) \right\} \\&\left. = \max _{i=1,\dots ,k} \left\{ Q_s\left( x^i, \hat{h}_{2s}\right) + \left( \alpha _s^i\right) ^\intercal \left( x - x^i\right) \right\} \right\} . \end{aligned} \end{aligned}$$

Therefore, the inner CPM inside step 1 finishes with finitely many iterations under mild conditions [4, 25]. There is at least one modification of Algorithm 1 that one is tempted to consider that is to improve the representation of \(Q(x, h_{21}, \dots , h_{2S})\) using more base scenarios \((\hat{h}_{21}, \dots , \hat{h}_{2S})\). This would translate into repeating step 1 some times. However, at least for our experience, this does not help much because the statistical properties of \(\lbrace \hat{v}^l : l=1,\dots ,L \rbrace\) hardly change. The free-floating cuts we propose also allow to re-sample the base scenario inside step 1 and make one forward-backward (Step 1.1 to Step 1.6) for each new base scenario. While all these modifications are possible, we find that the algorithm as presented provides better results.

As commented briefly in the introduction, the free-floating cut requires storing much more coefficients. For instance, in the two-stage case we would have to store \((\alpha _s^i, \beta _s^i)\) for all \(s=1,\dots ,S\) and \(i=1,\dots ,k\), which uses much more memory. This might indeed be a problem if only primary memory (RAM) is used. However, as we explain now, it is possible to break free of this limitation storing cuts on secondary memory (files) because it is much cheaper and exists in greater quantity. Clearly, there is an associated overhead due to use of files, but this is very small when compared to the current alternative, which is to solve many SDDP problems from scratch.

First, recall that LPs are solved on primary memory. For the sake of simplicity assume that Algorithm 1 is implemented without any parallelization. In this case, the minimum memory required to run Algorithm 1 is the memory to solve the LPs. Now, notice that this amount of memory is not affected by the additional coefficients \(\beta _s^i\) because those influence only the free term of the cut, as can be noted by analogy examining (6). In other words, proceeding as in (11) if considering the aggregate cut, for any \(s=1,\dots ,S\) and \(i=1,\dots ,k\), the cuts actually used on (13) are given by

$$\begin{aligned} Q\left( x, \hat{h}_{21}^l, \dots , \hat{h}_{2S}^l\right) \ge U\left( x^i, \hat{h}, \hat{h}^l\right) + {1 \over S} \sum _{s=1}^S \left( \alpha _s^i\right) ^\intercal \left( x - x^i\right) \quad \forall x \end{aligned}$$
(15)

where \(U(x^i, \hat{h}, \hat{h}^l)\) is a number calculated as

$$\begin{aligned} U\left( x^i, \hat{h}, \hat{h}^l\right) = Q\left( x^i, \hat{h}_{21}, \dots , \hat{h}_{2S}\right) + {1 \over S} \sum _{s=1}^S \left( \beta _s^i\right) ^\intercal \left( \hat{h}_{2s}^l - \hat{h}_{2s}\right) . \end{aligned}$$
(16)

Therefore, the coefficients \(\beta _s^i\) do not need to be stored on the main memory, because they are not directly used during the solution of the LPs. It is enough to store them on a file as they are obtained during the iterations of Step 1 of Algorithm 1 and read the respective file to calculate the intercepts of the cuts as needed along the process. Although this is possible, we use only main memory during our experiments, as they are intended only for the illustration of the methods. The same process applies to the multicut formulation, since the terms involving \(\beta _s^i\) for the s-th scenario are summarized into the free term of the i-th cut of the s-th scenario.

The purpose of the following example is to illustrate Step 1 of Algorithm 1 and compare it against the analytical solution, which in this case is available because the problem at hand is simple enough. It is not to show known convergence results for sample average approximations. It also shows the same features that we observe on the experiments of Sect. 5.

Example 1

Take a real random variable h following a uniform distribution on [0, 1] and define

$$\begin{aligned} Q(x, h) = \min _{y \ge 0} \lbrace y : y \ge x - h, y \ge h - x \rbrace = |x-h|. \end{aligned}$$

By definition, we have \(\mathcal{Q}(x) = \mathbb{E}_h |x - h|\). For any \(x \in [0,1]\), it follows that

$$\begin{aligned} \mathcal{Q}(x) = \int _{0}^x (x - h) dh + \int _x^1 (h - x) dh = x^2 - x + {1 \over 2}. \end{aligned}$$

The solution to the problem

$$\begin{aligned} \min _{x \in [0,1]} \quad \mathcal{Q}(x) \end{aligned}$$

is \(x = 0.5\) and optimal value is 0.25. Now, let us consider a replication of h as an independent and identically distributed sequence \(\lbrace h_s : s \in \mathbb{N}\rbrace\) such that \(h_s \sim h\). The sample average approximation is given by

$$\begin{aligned} \min _{x \in [0,1]} \quad {1 \over S} \sum _{s=1}^S Q\left( x, \hat{h}_s\right) \quad = \quad {1 \over S} \sum _{s=1}^S |x - \hat{h}_s|. \end{aligned}$$

Without loss of generality, let us assume that \(\hat{h}_s \le \hat{h}_{s+1}\) for all \(s=1,\dots ,S-1\). If S is odd, it follows that the median value of the sequence \(\lbrace \hat{h}_s \rbrace\) solves the sample average problem. If S is even, the situation is a bit more tricky. For instance, if \(S=2\), \(\hat{h}_1 = 0\) and \(\hat{h}_2 = 1\), then any \(x \in [0,1]\) solves the problem. In general, we take the only s such that the median of \(\lbrace \hat{h}_s \rbrace\) lies inside \([\hat{h}_s, \hat{h}_{s+1}]\). Then, any \(x \in [\hat{h}_s, \hat{h}_{s+1}]\) solves the sample average problem. Therefore, the optimal value of x tends to be close to 0.5 when S is big.

Step 1 of Algorithm 1 starts with \(k=1\) and \(x^k=0\). The cut obtained by applying (3) for each scenario s at the first iteration is

$$\begin{aligned} h_s - x = \hat{h}_s + 1^\intercal \left( h_s - \hat{h}_s\right) - 1^\intercal (x - 0) = Q\left( 0, \hat{h}_s\right) - 1^\intercal \left( h_s - \hat{h}_s\right) - 1^\intercal (x - 0) \end{aligned}$$

because the active constraint is \(y \ge \hat{h}_s - x\) with associated multiplier 1. After averaging across s we obtain \({1 \over S} \sum _{s=1}^S h_s - x\). At the second iteration (\(k=2\)) we obtain \(x^k=1\) and for each s the cut is

$$\begin{aligned} x - h_s = 1 - \hat{h}_s - 1^\intercal \left( h_s - \hat{h}_s\right) + 1^\intercal (x - 1) = Q\left( 1, \hat{h}_s\right) - 1^\intercal \left( h_s - \hat{h}_s\right) + 1^\intercal (x - 1) \end{aligned}$$

because the active constraint is \(y \ge x - \hat{h}_s\) with multiplier 1. After averaging across s we obtain the cut \(x - {1 \over S} \sum _{s=1}^S h_s\). After applying the values of \(h_s = \hat{h}_s\), the master problem (Step 1.4) at the third iteration is

$$\begin{aligned} \min _{x, r} \quad r \quad : \quad x \in [0, 1], \quad r \ge {1 \over S} \sum _{s=1}^S \hat{h}_s - x, \quad r \ge x - {1 \over S} \sum _{s=1}^S \hat{h}_s. \end{aligned}$$

Therefore, the third iterate of x is the average of \(\hat{h}_s\), which is already a good estimate of the optimal solution. However, the estimate of the optimal value is still bad. The next cuts added already make the gap (Step 1.5) small, however they start showing complicated cancellation effects because, contrary to the first two iterations, for some scenarios the constraint \(y \ge x - \hat{h}_s\) is active and for other scenarios the other constraint \(y \ge \hat{h}_s - x\) is active.

Finally, computing the histogram of the realizations of the optimal value with Step 2 of Algorithm 1 with \(L=2000\) we obtain Fig. 1. We perform 20 iterations on Step 1. Note that even for \(S=10\) (left-most figure), the average of the realizations is close to the true optimal value 0.25. While we do not explore mathematically this behavior, we also observe on Sect. 5 that the center of the histograms on Fig. 4 is stable across S. We also observe that the histograms are collapsing to a distribution concentrated at the point 0.25 when S grows, which is the distribution of the true optimal value (a deterministic value). Moreover, by [24, Theorem 5.7] we know that the variance of the histogram goes to zero at rate \(1\over \sqrt{S}\).

Fig. 1
figure 1

Histogram of the realizations of the optimal values of Step 2 of Algorithm 1 with \(L=2000\)

4 Multistage stochastic problems

In this section we show how to perform the sensitivity of the results of the SDDP algorithm with respect to the RHS vectors. Naturally, the method is the extension of the one for the two-stage case applied recursively and backwards. In the last section, the cost-to-go function at \(t=1\) also depends on \((h_{21}, \dots , h_{2S})\). Analogously, for the SDDP, the cost-to-go function at stage t also depends on all the RHS vectors of the forward problems. With S stage-wise independent scenarios per stage and a total of T stages, these vectors can be represented by a sequence denoted by

$$\begin{aligned} h[t] := (h_{\tau s} \quad : \quad \tau =t+1,\dots ,T, \quad s=1,\dots ,S). \end{aligned}$$
(17)

We understand h[t] as an empty tuple if \(t \ge T\). Again, the actual sampling of these vectors is denoted by \(\hat{h}[t]\). The first stage feasible set is the same as in the two-stage case and the feasible sets for \(t = 2, \dots , T\) and \(s=1,\dots ,S\) are given by

$$\begin{aligned} D_{ts}(x_{t-1}, {h}_{ts}) := \lbrace x_{ts} \ge 0 \quad : \quad W_{t} x_{ts} = h_{ts} - B_{t} x_{t-1}\rbrace . \end{aligned}$$

Note that the feasible set \(D_{ts}\) depends also on \({h}_{ts}\). The first stage problem is given by

$$\begin{aligned} \min _{x_1} \quad c_{1}^\intercal x_1 + Q_2(x_1, {h}[1]) \quad \text{s.t.} \quad x_1 \in D_1 \end{aligned}$$

where for \(t = 2, \dots , T\) and \(s=1,\dots ,S\) we have

$$\begin{aligned} Q_t(x_{t-1}, {h}[t-1]) := S^{-1} \sum _{r=1}^S Q_{tr}(x_{t-1}, {h}_{ts}, {h}[t]) \end{aligned}$$
(18)

and

$$\begin{aligned} Q_{ts}(x_{t-1}, {h}_{ts}, {h}[t]) := \min _{x_{ts}} \quad c_{t}^\intercal x_{ts} + Q_{t+1}(x_{ts}, {h}[t]) \quad \text{s.t.} \quad x_{ts} \in D_{ts}(x_{t-1}, {h}_{ts}) \end{aligned}$$

with

$$\begin{aligned} Q_{T+1}(\cdot , \cdot ) \equiv 0. \end{aligned}$$

For instance, note that (18) generalizes (10). As usual, our method is also based on building polyhedral approximations of \(Q_{ts}\) denoted \(Q_{ts}^k\) with the upper index k. The aggregate value function used on the algorithm below is defined for all k by

$$\begin{aligned} Q^k_t(x_{t-1}, {h}[t-1]) := S^{-1} \sum _{r=1}^S Q^k_{tr}(x_{t-1}, {h}_{ts}, {h}[t]). \end{aligned}$$
(19)

Algorithm 2

(Sensitivity of SDDP problems with respect to RHS vectors)

  • Initialization. Take a tolerance \(\varepsilon \ge 0\). Set \(k=1\). Take S and a base sample \(\hat{h}[1]\). Take \(M > 0\) large. Set \(Q_{ts}^k(\cdot , \cdot , \cdot ) \equiv -M\) for all \(t=T,\dots ,2\) and all \(s=1,\dots ,S\). For all \(t=T,\dots ,2\) the functions \(Q_{t}^k(\cdot , \cdot )\) are given by formula (19).

  • Step 1: Solve Sample SDDP Problem using Modified Cuts.

    • Step 1.1: Sample Scenario. Sample a scenario \(\omega _t^k \in \lbrace 1, \dots , S \rbrace \quad \forall t=2,\dots ,T\).

    • Step 1.2: Modified Forward. Compute

      $$\begin{aligned} \hat{x}_{1}^k \in \arg \min _{x_1} \quad c_{1}^\intercal x_1 + Q_{12}^k\left( x_{1}, \hat{h}[1]\right) \quad \text{s.t.} \quad x_1 \in D_1 \end{aligned}$$

      and for \(t = 2, \dots , T\) and \(s= \omega _t^k\) compute

      $$\begin{aligned} \hat{x}_{t}^k \in \arg \min _{x_{ts}} \quad c_{t}^\intercal x_{ts} + Q_{t+1}^k\left( x_{ts}, \hat{h}[t]\right) \quad \text{s.t.} \quad x_{ts} \in D_{ts}\left( \hat{x}_{t-1}^k, \hat{h}_{ts}\right) . \end{aligned}$$
    • Step 1.3: Modified Backward. For all \(t=T,\dots ,2\) and all \(s=1,\dots ,S\) compute

      $$\begin{aligned} x_{ts}^k \in \arg \min _{x_{ts}} \quad c_{t}^\intercal x_{ts} + Q_{t+1}^{k+1}\left( x_{ts}, \hat{h}[t]\right) \quad \text{s.t.} \quad x_{ts} \in D_{ts}\left( \hat{x}_{t-1}^k, \hat{h}_{ts}\right) \end{aligned}$$

      where \(Q_{T+1}^{k+1}(\cdot , \cdot ) \equiv 0\) and for \(t=T-1, \dots , 1\) take \(Q_{t+1}^{k+1}(\cdot , \cdot )\) given by formula (19). For \(t=T-1, \dots , 1\) and \(s=1,\dots ,S\) take \(Q_{t+1, s}^{k+1}(\cdot , \cdot , \cdot )\) as a maximum between \(Q_{t+1, s}^{k}(\cdot , \cdot , \cdot )\) and a free-floating cut of \(Q_{t+1, s}(\cdot , \cdot , \cdot )\) at the point \((\hat{x}_t^k, \hat{h}_{t+1,s}, \hat{h}[t+1])\) as explained in (3). Precisely, for each \(t=T-1, \dots , 1\) and \(s=1,\dots ,S\) calculate subgradients

      $$\begin{aligned} (\alpha _{tsk}, \beta _{t+1,sk}, \gamma _{t+1,sk}) \in \partial Q_{t+1,s}\left( \hat{x}_t^k, \hat{h}_{t+1,s}, \hat{h}[t+1]\right) \end{aligned}$$

      and take \(Q_{t+1, s}^{k+1}(\cdot , \cdot , \cdot )\) as a maximum between \(Q_{t+1, s}^{k}(\cdot , \cdot , \cdot )\) and the affine function

      $$\begin{aligned}&Q_{t+1, s}\left( \hat{x}^k_t, \hat{h}_{t+1}, \hat{h}[t+1]\right) + \alpha _{tsk}^\intercal \left( x_{t} - \hat{x}_t^k\right) + \beta _{t+1,sk}^\intercal \left( h_{t+1} - \hat{h}_{h+1,s}\right) \nonumber \\&\qquad + \gamma _{t+1,sk}^\intercal \left( h[t+1] - \hat{h}[t+1]\right) . \end{aligned}$$
    • Step 1.4: Lower Bound. Compute

      $$\begin{aligned} x_{1}^k \in \arg \min _{x_1} \quad c_{1}^\intercal x_1 + Q_{2}^{k+1}\left( x_{1}, \hat{h}[1]\right) \quad \text{s.t.} \quad x_1 \in D_1. \end{aligned}$$
      (20)

      Set \(\underline{v}_k\) as the optimal value of (20) and \(\overline{v}_k = \sum _{t=1}^T c_t^\intercal \hat{x}_t^k\).

    • Step 1.5: Stopping Test. Go to Step 2 if the lower bound \(\underline{v}_k\) stabilized across k or if the average forward cost \(\overline{v}_i\) for \(i=1,\dots ,k\) is close enough to the lower bound \(\overline{v}_k\) as expressed by

      $$\begin{aligned} {1 \over k} \sum _{i=1}^k \overline{v}^i \le \varepsilon + \underline{v}_k. \end{aligned}$$
    • Step 1.6: Loop. Set \(k = k+1\) and go back to Step 1.1.

  • Step 2: Evaluate Fast Lower Bounds. For \(l=1,\dots ,L\) sample a new sequence \(\hat{h}^l[1]\) and compute

    $$\begin{aligned} \hat{v}_1^l := \min _{x_1} \quad c_1^\intercal x_1 + Q_1^k\left( x_1, \hat{h}^l[1]\right) \quad \text{s.t.} \quad x_1 \in D_1. \end{aligned}$$
  • Step 3: Output. Compute statistics of interest over \(\lbrace \hat{v}_1^l : l=1,\dots ,L \rbrace\).

As explained previously for the two-stage case, some variations of Algorithm 2 could have been considered. Among the most obvious possibilities, we could run step 1 more times for more base samples \(\hat{h}[1]\). We could also have proposed to sample trees inside step 1 and run one forward-backward pass (Step 1.1 to Step 1.6) for each new tree. Note that this is all possible within the free-floating cuts approach. Because the number of possible discretizations of the true SDDP is far greater than our ability to solve at least a minor portion of them, running step 1 more times re-using cuts already computed tends not to change the results in step 2. On the other hand, the sampling in step 2 can be quite exhaustive because L can be quite big. The problem with sampling trees inside step 1 is that the stopping test 1.5 would be arbitrary, as opposed to the current of having solved the base tree sampled in the start and apply commonly used stopping tests for the traditional SDDP.

The convergence proof for the SDDP method inside step 1 is the same of the traditional case [25] because the polyhedral approximations \(Q_{t+1}^k(x_{ts}, \hat{h}[t])\) are the same. This is so because the free-floating parts of the cuts are evaluated exactly at the base point \(h = \hat{h}\), using formula (3) for the analogy. The contribution of the free-floating part of the cuts is only at step 2. More precisely, for any \(t \in \lbrace 1, \dots , T-1 \rbrace\), \(s \in \lbrace 1,\dots ,S \rbrace\) and \(i \in \lbrace 1,\dots , k \rbrace\) there are subgradients

$$\begin{aligned} (\alpha _{tsi}, \beta _{t+1,si}, \gamma _{t+1,si}) \in \partial Q_{t+1,s}(\hat{x}_t^i, \hat{h}_{t+1,s}, \hat{h}[t+1]) \end{aligned}$$

such that \(Q_{t+1, s}^{k}(x_t, h_{t+1}, h[t+1])\) is a maximum for \(i=1,\dots ,k\) of the affine functions

$$\begin{aligned}&Q_{t+1, s}\left( \hat{x}^i_t, \hat{h}_{t+1}, \hat{h}[t+1]\right) + \alpha _{tsi}^\intercal \left( x_{t} - \hat{x}_t^i\right) \nonumber \\&\qquad + \beta _{t+1,si}^\intercal \left( h_{t+1} - \hat{h}_{h+1,s}\right) + \gamma _{t+1,si}^\intercal \left( h[t+1] - \hat{h}[t+1]\right) . \end{aligned}$$
(21)

Since Step 1 of Algorithm 2 evaluates (21) at \(h_{t+1} = \hat{h}_{t+1,s}\) and \(h[t+1] = \hat{h}[t+1]\), we conclude that Step 1 is exactly as the traditional SDDP method. Therefore, the convergence properties indeed follow from [25].

In the multi-stage case, the issue with memory usage explained on Sect. 3 becomes critical because memory complexity is quadratic (in the computer science sense). Note that the dimension of the cuts at stage t is about the same of h[t]. Because the dimensions of h[t] increase when t runs backwards, we get an additive effect as in the arithmetic progression that leads to quadratic memory complexity. Precisely, as can be seen in (17), the dimension of h[t] is about \(S(T-t)H\), where H is the dimension of the samples \(\hat{h}_{ts}\). Therefore, the memory complexity for storing the cuts on main memory from time T to time t for one iteration is \(\mathcal{O}( S(T - t)^2 )\). Then, the total memory complexity for one iteration is \(\mathcal{O}( S T^2)\). These estimates assume that the cuts are aggregated across scenarios, otherwise, if they are stored separately for each scenario, we get instead \(\mathcal{O}(S^2T^2)\) per iteration. This is the reason we use aggregated cuts, instead of the multicut formulation.

In spite of such quadratic complexities, the algorithm is still useful. First, the other possibilities based on fully solving an SDDP problem for each new tree are in practice much more expensive when T or S are big. Second, a strategy based on using secondary memory as in Sect. 3 can also be put in place for multistage problems for exactly the same reasons. Namely, the additional coefficients multiplying h[t] do not need to be stored on main memory because their influence on any given LP to be solved by Algorithm 2 reduces to the free term of any cut. The mathematical description is similar to Eqs. (15) and (16) so that we leave the details for the reader.

As commented briefly in the introduction, fast upper bound estimates can also be obtained without having to run the backward pass for the new trees, which is the costly part of Algorithm 2. First, recall that

$$\begin{aligned} Q_{t+1}^k(x_{ts}, h[t]) \le Q_{t+1}(x_{ts}, h[t]) \quad \forall x, \quad \forall h[t], \quad \forall k, \quad \forall t=1,\dots ,T-1. \end{aligned}$$
(22)

Therefore, the estimates for the value functions obtained by Algorithm 2 are valid estimates of the cost-to-go function for all stages, trees and iterates. The key for computing fast upper bounds is to reuse the value functions already obtained as described in Algorithm 3, which is presented separately from Algorithm 2 because the computation of upper bounds is not the focus of the paper. We just want to point out the possibility. Nonetheless, the trees sampled on Step 1.1 of Algorithm 3 should be the same as in Step 2 of Algorithm 2 so that the fast upper bound can be compared with the fast lower bound for a given \(\hat{h}^l[1]\).

Algorithm 3

(Sensitivity of upper bounds of SDDP problems with respect to RHS vectors)

  • Initialization. Take integers \(L > 0\) and \(K > 0\).

  • Step 1: Iterate New Trees. For \(l=1,\dots ,L\)

    • Step 1.1: Tree Sample. Sample a new sequence \(\hat{h}^l[1]\).

    • Step 1.2: Iterate Forward Paths. For \(k=1,\dots ,K\)

      • Step 1.2.1: Forward Sampling. Sample a scenario \(\omega _t^k \in \lbrace 1, \dots , S \rbrace \quad \forall t=2,\dots ,T\).

      • Step 1.2.2: Modified Forward. Compute

        $$\begin{aligned} \hat{x}_{1}^k \in \arg \min _{x_1} \quad c_{1}^\intercal x_1 + Q_{12}^k\left( x_{1}, \hat{h}^l[1]\right) \quad \text{s.t.} \quad x_1 \in D_1 \end{aligned}$$

        and for \(t = 2, \dots , T\) and \(s= \omega _t^k\) compute

        $$\begin{aligned} \hat{x}_{t}^k \in \arg \min _{x_{ts}} \quad c_{t}^\intercal x_{ts} + Q_{t+1}^k(x_{ts}, \hat{h}^l[t]) \quad \text{s.t.} \quad x_{ts} \in D_{ts}\left( \hat{x}_{t-1}^k, \hat{h}^l_{ts}\right) . \end{aligned}$$
      • Step 1.2.3: Forward Path Cost. Set \(\overline{v}^l_k = \sum _{t=1}^T c_t^\intercal \hat{x}_t^k\).

    • Step 1.3: Upper Bound Estimate. Set \(\overline{v}^l = {1 \over K} \sum _{k=1}^K \overline{v}^l_k\).

  • Step 2: Output. Compute statistics of interest over \(\lbrace \overline{v}^l : l=1,\dots ,L \rbrace\).

5 Experiments

All experiments are run on an Intel i7 1.90GHz machine, with 15Gb of RAM, Ubuntu 18.04.3 LTS, CPLEX 12.10 and C++. For ease of explanation we consider the configuration of the experiments, given next, fixed throughout the section.

The experiments consider the integrated management of a cascade with a total of \(G=3\) hydroplants for \(T=12\) months (see Fig. 2). A hydroplant is a reservoir (a triangle) equipped with a turbine downwards (rounded rectangle). The main idea is that such hydroplants are connected via tubulations so that the water released from a hydroplant to produce energy, ends up stored at the reservoirs of other hydroplants. The natural objective for an agent seeking profit is then to produce energy at months with high prices. However the production cannot be concentrated only at the month with the highest price because there is the risk of having to waste water because reservoirs become full or because there are lower and upper limits to the volume of water released per day due to environmental reasons. The cascade is illustrated in Fig. 2.

Fig. 2
figure 2

The water flows downwards. The volume of each hydroplant g at time t is \(v_{gt}\). The water that is used to produce energy is \(u_{gt}\) and the water that is released without producing energy is \(w_{gt}\)

The water flows from the top of the cascade to the bottom. The hydroplant at the top is identified with \(g=1\), the next with \(g=2\) and so on. There is at most one hydroplant immediately next any other. The g-th hydroplant receives water from the rain and possibly also the water released from the only upwards hydroplant. The hydroplants are used to produce energy that is sold at deterministic prices for simplicity and because price does not enter on the RHS. The inflows of water from the rain are stochastic. The objective is to maximize the expected profits of selling energy subject to constraints regarding the amount of water released and withheld. First we show the management problem in the deterministic setting and then explain the differences for the stochastic setting. The variables of the optimization problem are

  • the turbined water \(u_{gt}\) at hydroplant g and time t (water that passes through the turbine to produce energy),

  • the reservoir volume \(v_{gt}\) at hydroplant g and time t (raw volume of the reservoir),

  • the spilled water \(w_{gt}\) at hydroplant g and time t (water released without producing energy in case the reservoir is full).

The constants for each hydroplant are

  • the productivity \(\varvec{\rho }_g\) of hydroplant g (the fraction of water turbined that is transformed into energy),

  • the price \(\varvec{\pi }_t\) of energy at time t (price for selling energy),

  • the inflow of water from rain \(\varvec{a}_{gt}\) at hydroplant g and time t,

  • the max volume \(\overline{\varvec{v}}_g\) of hydroplant g,

  • the dead volume \(\underline{\varvec{v}}_g\) of hydroplant g (volume below which it is not possible to generate energy),

  • the maximum spillage capacity \(\overline{\varvec{w}}_g\) of hydroplant g (capacity for releasing water per month without producing energy in case the reservoir is full),

  • the maximum and minimum outflows for hydroplant g, respectively denoted by \(\overline{\varvec{e}}_g\) and \(\underline{\varvec{e}}_g\) (outflow is the sum of water turbined and spilled).

The units of \(u_{gt}, w_{gt}, {\varvec{a}_{gt}}, \overline{\varvec{w}}_g, \overline{\varvec{e}}_g, \underline{\varvec{e}}_g\) are cubic meters per month, while \(v_{gt}\) is cubic meters only. The deterministic analogous of the problem solved is shown below, where decisions are not present if \(g - 1 < 1\).

$$\begin{aligned} \begin{aligned} \underset{}{\max }\,\,&\sum _{t, g} \varvec{\rho }_g \varvec{\pi }_t u_{gt} \\ \text{subject to}\,\,&v_{g,t+1} = v_{gt} - u_{gt} - w_{gt} + u_{g-1,t} + w_{g-1,t} + \varvec{a}_{lt}, \\&u_{gT} + w_{gT} \le v_{gT} + \varvec{a}_{gT}, \\&u_{gt} \le v_{gt} - \underline{\varvec{v}}_g, \\&v_{gt} \in [0, \overline{\varvec{v}}_g], \quad w_{gt} \in [0, \overline{\varvec{w}}_g], \quad u_{gt} \in [0, \overline{\varvec{u}}_g], \quad u_{gt} + w_{gt} \in [\underline{\varvec{e}}_g, \overline{\varvec{e}}_g]. \end{aligned} \end{aligned}$$
(23)

Naturally, problem (5) suffers from the end-of-period effect. However, we use it during the experiments for illustrative purposes. The constraints \(u_{gt} + w_{gt} \in [\underline{\varvec{e}}_g, \overline{\varvec{e}}_g]\) are written due to environmental reasons. Nonetheless, \(u_{gt} + w_{gt} \in [\underline{\varvec{e}}_g, \overline{\varvec{e}}_g]\) and \(w_{gt} \in [0, \overline{\varvec{w}}_g]\) are neglected on the experiments below. Therefore, there is no need to report the quantities \(\underline{\varvec{e}}_g, \overline{\varvec{e}}_g\) and \(\overline{\varvec{w}}_g\). The constraint \(u_{gt} \le v_{gt} - \underline{\varvec{v}}_g\) says that water turbined is only of what exceeds the dead volume. The constraint \(u_{gT} + w_{gT} \le v_{gT} + \varvec{a}_{gT}\) just says that at the end the reservoir level has to be non-negative and disregards the decisions of the other hydroplants.

We take the efficiencies \(\varvec{\rho }_g = 1\) for \(g=1,2,3\). The dead volumes are 10% of the maximum volumes and the initial volumes are 15% of the maximum volumes. The maximum turbining capacity per month is 50% of the maximum volume. The maximum volumes are \(\varvec{\overline{v}}_1 = 1.6, \varvec{\overline{v}}_2 = 1.0\) and \(\varvec{\overline{v}}_3 = 1.6\). The limits for spillage, max outflow and min outflow are not used. The deterministic profiles for prices and inflows are shown below.

Fig. 3
figure 3

Note that the prices and inflows are “out-of-phase". The dry months happen around \(t=6\) as well as higher prices of energy. After adding noises to these profiles, we truncate negative results to zero

Let us now explain what are the differences in the stochastic setting. The inflows \(\varvec{a}_{gt}\) are changed to \(\varvec{a}_{gts}\) for all \(s=1,\dots ,S\) because they are stochastic and assumed to be stage-wise independent. Still inside the SDDP framework, the inflows can also be modeled as an auto-regressive process (such as ARIMA) where the noise is stage-wise independent. The stage-wise and scenario-wise decomposition of problem (23) is given below.

$$\begin{aligned} Q_{ts}(\varvec{v}_{g,t-1}, \varvec{a}_{gts}, \varvec{a}[t]) = \left\{ \begin{aligned} \underset{}{\max }\,\,&\sum _{g=1}^G \varvec{\rho }_g \varvec{\pi }_t u_{gt} + {1 \over S} \sum _{r=1}^S Q_{t+1,r}( v_{gt}, \varvec{a}_{g,t+1,r}, \varvec{a}[t+1] ) \\ \text{subject to}\,\,&v_{g,t} = \varvec{v}_{g,t-1} - u_{gt} - w_{gt} + u_{g-1,t} + w_{g-1,t} + \varvec{a}_{gts}, \\&u_{gT} + w_{gT} \le v_{gT} + \varvec{a}_{gT}, \\&u_{gt} \le v_{gt} - \underline{\varvec{v}}_g, \\&v_{gt} \in [0, \overline{\varvec{v}}_g], \quad w_{gt} \in [0, \overline{\varvec{w}}_g], \quad u_{gt} \in [0, \overline{\varvec{u}}_g], \quad u_{gt} + w_{gt} \in [\underline{\varvec{e}}_g, \overline{\varvec{e}}_g]. \end{aligned} \right. \end{aligned}$$

Indeed, such ARIMA model mentioned above is one of the motivations of this work because the noise follows a continuous distribution in which case the sensitivities are important. For instance, the tendency is that we would need more scenarios to get a good representation of the uncertainty per stage if G is big. Nonetheless, the resulting problem could be computationally too hard and then we would like to be at least aware of sensitivities of the cutting planes model to the sample of the inflows. Instead of fitting an ARIMA, we make the experiments taking a deterministic profile and then adding different types of noises around the profile. We expect that distributions with heavier tails would need more scenarios to be represented properly.

Now, all the information about the problem is given, except for the type of noise around the profiles in Fig. 3 and the total scenarios used on the discretization. These last two items are the only that vary on the experiments that follow, which are organized into some parts, each one illustrating an important aspect or application of the RHS sensitivity of MSOP. In particular, we want to

  • illustrate numerically the lower bounding property for the approximations of value functions, and

  • measure how the discretized SDDP problem relates to the true problem.

The motivation for the first item is that it is the fundamental building block for the entire paper. Therefore, it is a nice validation to report. The second item includes two applications as we describe now. The probability distribution of scenarios of the true SDDP problem can be continuous or discrete. For the continuous case, the scenarios can follow a multivariate normal distribution, for instance. In the discrete case, the scenarios can be finitely many, but way too many to put into a computer. For both cases, the discretization of the true uncertainty comes into play strongly.

It turns out that the discretization of an already discrete distribution is known as scenario reduction. We further split the scenario reduction problem into (i) the selection of the discretization itself and (ii) the evaluation of the discretization. Scenario reduction algorithms usually solve problem (i), which is not our objective. We want to solve problem (ii), given that the discretization has been made. To achieve such objective, we can take any discretization, since it does not matter which one. Our point is to show that the computation of sensitivities is possible and effective, not to obtain the discretization with the least sensitivity.

The objectives of the experiments can be further detailed as:

  • Validate numerically the lower bounding property of the approximations of the value functions when iterates, trees or stages are changed, as claimed in Eq. (22). These checks are reported on Sect. 5.1. Column LBSTEP2 reports lower bounds computed as in Step 2 of Algorithm 2 and compared against an independent lower bound using the traditional SDDP method (column LBSDDP). Moreover, the lower bounds from reusing all previous value function approximations obtained during the experiments of Sect. 5.1 serves to illustrate Eq. (22) and are reported on the columns LBSDDP* and UBSDDP*. The objective of this subsection is not to prove that LBSTEP2 bounds are weaker than those obtained independently (LBSDDP, UBSDDP), all of which are weaker than those reusing cuts (LBSDDP*, UBSDDP*). Instead, we use there only \(S=10\) or \(S=20\) scenarios (a small number compared to the other sections) because each independent SDDP run to validate the LBSTEP2 already takes around 10 minutes, which validates that a fast sensitivity technique is helpful. Moreover, as is clear from the other subsections, \(S=20\) is not enough to make the sample average problem come close to the true problem, which implies that a bigger sensitivity is justified and therefore a weaker fast lower bound is expected.

  • Validate numerically that when we discretize an already discrete distribution by selecting a subset of scenarios to use, the sensitivities of the optimal values converge to zero as the discretized distribution converges to the original distribution. This test is shown in Sect. 5.2 on Tables 5 and 6 as S approaches \(\overline{S}\). It validates that the sensitivities reproduce our statistical intuition. This test is related to scenario reduction techniques, however our objective is only to evaluate a possible scenario reduction method. When the original distribution is discrete, convergence of the discretized distribution to the original one can be achieved exactly. This is in contrast to the tests performed on Sect. 5.3, where the original distribution is continuous and can only be approximated.

  • Validate numerically that even for a continuous distribution with an unbounded support, the sensitivities obtained by Algorithm 2 get smaller and smaller as S grows bigger. These tests are shown in Sect. 5.3. Comparing Tables 7 and 8 to the results of Sect. 5.2 suggests that sensitivities for continuous distributions still get small, but “convergence” is harder to achieve. We note that the histograms shown in Fig. 4 are qualitatively similar to the ones in Fig. 1.

5.1 Validity of lower bounds across trees

In this subsection we illustrate that the lower bounds we obtain are indeed valid. This is done by comparing the lower bounds of Step 2 of Algorithm 2 with the lower and upper bounds obtained solving the new trees separately with the usual SDDP. For this purpose, the distributions of the noises applied to Fig. 3 do not matter, as is clear from algorithms 1 and 2 because the lower bound is valid for all trees. The distribution just influences the rate of convergence of the sample average to the continuous problem. We also report on the LBSDDP* and UBSDDP* columns the solution of the new trees reusing previous free-floating cuts. The results are shown in Tables 1 and 2. The running times for Tables 1 and 2 are reported in Tables 3 and 4, respectively.

So that it is easier for the reader to remember the acronyms, we explain again the three variants considered in this section:

  • Results for Algorithm 2 are shown in column LBSTEP2 (Tables 1 and 2) and in column STEP2 (Tables 3 and 4). The column LBSTEP2 reports the lower bounds obtained with Algorithm 2 and column STEP2 reports the respective running times.

  • Results for the traditional SDDP algorithm are reported in columns LBSDDP and UPSDDP (Tables 1 and 2) and the respective running times in column “SDDP” (Tables 3 and 4). The column LBSDDP reports the lower bound and UPSDDP reports the upper bound.

  • The SDDP algorithm cannot reuse the approximation of the cost-to-go functions when the tree changes. The variant “star" reuses the previous cuts calculated when solving the previous trees, via application of the transportation formulas developed in the paper, such that the cuts are valid for the new trees. The columns LBSDDP* and UPSDDP* (Tables 1 and 2) report the lower bound and upper bound and the column SDDP* (Tables 3 and 4) reports the running times. The variant “star” makes the same amount of forward-backward passes as the traditional SDDP for any given tree.

Notice that the value functions on Algorithm 2 are initialized with a negative constant. This is the initialization of value functions for the base tree. The index \(l \ge 1\) below, means that the value functions for the columns LBSDDP* and UBSDDP* are initialized with the value functions from experiment \(l-1\), instead of the negative constant mentioned. On the other hand, the values of columns LBSDDP and UBSDDP are based on solving the problem from scratch, that is, initializing value functions with the negative constant.

For Table 1, we use \(S=10\) scenarios per stage. The scenarios are generated adding a uniform noise around the inflow profile with range \([0,\eta ]\), where \(\eta\) is the average inflow for the respective hydroplant. For each tree we perform 70 forward–backward iterations (Step 1 of Algorithm 2). First we solve the base tree. We find a lower bound of −724.37 and an upper bound of −706.58. We can see that the SDDP* columns show lower gaps at the expense of bigger running times because the subproblems are bigger due to more cuts coming from lower l. The cuts from the solution of the base tree are passed to the \(l=1\) tree and the resulting cuts to \(l=2\) and so on. At \(l=5\) all the previous cuts are being reused.

Generally speaking, we observe for both Tables 1 and 2 that LBSTEP2 is the weakest bound, followed respectively by LBSDDP and LBSDDP*. However, in spite of LBSTEP2 being weaker in absolute terms, the difference to the other bounds is less than 2% for all tests. On the other hand, the gap obtained by the independent SDDP run (UBSDDP–LBSDDP) is greater than 2%. This is a satisfactory result, since the fast lower bounds are obtained with much lower computational costs.

Table 1 Here we use \(S=10\)
Table 2 Here we use \(S=20\) and the same setting of Table 1

Regarding the computational costs themselves, the individual SDDP runs shown in columns LBSDDP and UBSDDP on Tables 1 and 2 take on average around \(M = 10\) minutes. Therefore, making the sequential sampling by solving each SDDP problem from scratch, would take ML minutes. The columns LBSDDP* and UBSDDP* are obtained running SDDP for a fixed number of iterations, which in this case is 70. This is the reason they take a similar amount of time when compared to columns LBSDDP and UBSDDP. Note however, that because columns LBSDDP* and UBSDDP* use all previous approximations for the future costs, the running times go up consistently as a function of l.

On the other hand, the Step 2 of Algorithm 2 takes an irrelevant amount of time for each linear problem, called \(\epsilon\). Then, Step 2 takes \(\epsilon L\) minutes. Therefore, the ratio betweent the running times of our algorithm and the traditional sequential sampling solving each SDDP from scratch, is

$$\begin{aligned} {{M + \epsilon L} \over {ML}}. \end{aligned}$$

Now, note that the solution time for one SDDP, called M, grows with with the number of scenarios and stages (S and T, respectively). Also, \(\epsilon\) is fixed for varying values of S and T. Therefore, for big LS and T we have that

$$\begin{aligned} {{M + \epsilon L} \over {ML}} \approx {\epsilon \over M} \approx 0. \end{aligned}$$

Therefore, the computational effort to calculate the lower bounds using the proposed approach is negligible relatively to calculating them by re-solving the SDDP problems from scratch.

Finally, the actual running times for Tables 1 and 2 are shown respectively in Tables 3 and 4. Notice that running times in Table 4 are roughly twice the ones in Table 3, due to the difference in the number of scenarios.

Note that the column SDDP* on Tables 3 and 4 reports larger running times, because (i) SDDP* also uses all the cuts calculated for the column SDDP and therefore the subproblems are more time consuming and (ii) SDDP* makes the same amount of forward-backward passes as the column SDDP. Therefore, SDDP* takes more time because the subproblems take more time.

Table 3 Running times (in min) for experiments shown in Table 1
Table 4 Running times (in min) for experiments shown in Table 2

5.2 Evaluation of scenario reduction techniques

In the context of scenario reduction [26, 27] we are given a big number of scenarios \(\overline{S}\) and want to find a subset of size S of these scenarios that represents well the problem with \(\overline{S}\) scenarios. The motivating assumption is that solving the problem with \(\overline{S}\) scenarios is too hard or impossible. The techniques proposed [26, 27] are based on heuristics because selecting the subset of size S cannot take longer than solving the problem with size \(\overline{S}\). Our technique of cut-sharing across trees can be used to evaluate a subset of size S once it is selected and solved. We are not proposing a scenario reduction technique. Instead, we propose a fast way to evaluate a scenario reduction algorithm.

Precisely, we can (i) use a technique of scenario reduction to select a subset of size S out of the \(\overline{S}\) scenarios and (ii) sample, without replacement, other sets of S scenarios from the \(\overline{S}\) and evaluate a fast lower bound. It has to be without replacement because if we are allowed to sample \(S = \overline{S}\) scenarios, we need to obtain exactly the original set of scenarios. If the set of S scenarios is well selected by the scenario reduction method the cutting planes model will not show significant sensitivity to other samples with the same size. This subsection tests this idea, but making the scenario reduction by selecting a subset of size S randomly. The results are shown in Tables 5 and 6.

Clearly, if a scenario reduction technique is already known to work well, there is no need to evaluate it even further. However, declaring that the scenario reduction method works well would, in general, involve calculating the optimal value of the underlying optimization problem for other sets of scenarios and compare the results against the set of scenarios selected. This is definitely not practical in the context of multistage stochastic optimization if one needs to solve the multistage problems from scratch. We would rather need fast ways to estimate the optimal values for those other sets of scenarios, which is the proposal of this section.

Our algorithm can also be used to find a number of scenarios that yields small variations of the optimal values when new trees of same size are resampled. In other words, the algorithm can be used to find the cardinality of the tree such that the optimal value estimates for those trees with same size are very close to each other.

For Table 5, the scenarios are generated adding a uniform noise around the inflow profile with range \([0,\eta ]\), where \(\eta\) is the average inflow for the respective hydroplant. We take \(\overline{S} = 70\) and take randomly a base tree with S scenarios per stage. The first and second columns report the lower and upper bounds for the base tree after making 70 forward–backward iterations. We realize an out-of-sample analysis of the resulting cutting planes model taking \(L=400\) in step 2 of Algorithm 2. The new groups of S scenarios inside step 2 are sampled without replacement. We observe that the closer S is to \(\overline{S}\), the more representative the results of the base tree are. Note that expectation of \(\hat{v}_1^l\) is within a 3% range from the lower bound for the base tree, while the gaps for the base tree (UB–LB) are also around 3%.

For Table 6, we use the same settings of Table 5, but with a uniform noise with 40% of the original range. Note that for \(S=\overline{S}\) the out-of-sample test is equivalent to randomly reordering the scenarios, which makes more numerical errors from the multiplier \(\hat{\lambda }\) in (3) become apparent. This is the reason that for \(S=70\) we have the lower bound different from the expectation of \(\hat{v}_1^l\). Notice that for \(S=10\), the standard deviation of \(\hat{v}_1^l\) is already around 2%, showing that adding many more scenarios does not change the results much.

Selecting a subset of size S randomly helps to verify the effectiveness of Algorithm 2 independently of any scenario reduction algorithm. We can observe on Tables 5 and 6 exactly the behavior we would expect from pure statistical reasoning. For instance, out of the \(\overline{S}=70\) scenarios per stage, we expect that half of those already contain most of the overall statistical information on the scenarios. Moreover, the closer S is to \(\overline{S}\), we would expect to see less and less sensitivity of the optimal value to the scenarios left out by the scenario reduction method, which is observed on the standard deviation of \(\hat{v}_1^l\) and on its max deviation from the mean. Therefore, the results of Algorithm 2 are consistent with overall statistical intuition.

Table 5 The average of \(\hat{v}_1^l\) across S is around 740.85, with standard deviation less than 1%, in spite of possibly big variations of \(\hat{v}_1^l\) for each S
Table 6 Note that the average of \(\hat{v}_1^l\) is stable across S and reports the true lower bound obtained with \(S = \overline{S}\)

5.3 Continuous distributions with unbounded supports

The goal in this section is to show the number S of sampled scenarios that is sufficient to yield little variation in the optimal value, which is harder in this case because one starts from a distribution with continuous unbounded support.

We use noises with compact support on Sects. 5.1 and 5.2. In this subsection we use our technique to evaluate the effect of sampling from continuous distributions with unbounded supports or heavy tails on the sensitivity of cutting plane models.

For our particular application, the occurrence of extreme values or mildly extreme values is equivalent to either a lot of rain flooding the reservoirs or no rain at all. For some other applications, the extreme event is at ultimate analysis also characterized by high costs. In other words, one would need to know the cost associated with the realizations to determine if it is extreme or not because the joint effects of the random variables could change the judgment. From an algorithmic perspective, our technique might be helpful because one would (i) solve a base tree, (ii) get estimates of the cost for many trees and (iii) select trees with the worst costs to be called extreme realizations, possibly improving the cost estimates for the trees selected and iterating back to (ii).

Note that for some applications, where “rare” events are considered, the extreme values we refer to might not ever appear even after sampling a large number of trees. However, this specific situation does not invalidate the approach of sampling a large number of trees and calculating the probability distribution of costs, which would already be useful for a probabilitistic analysis of mildly extreme events.

Next we show sensitivities of cutting plane models to the tree when the noise is normal. We take \(L=2000\) trees for the out-of-sample analysis and make 18 forward–backward iterations. This is much less than the previous experiments due to memory usage issues, because the amount of scenarios in this section reaches almost the triple of the previous experiments (\(S=200\)). However, such an early stop does not affect the convergence, as can be seen on the magnitude of the sensitivities, because 18 forward-backward passes for \(S=200\) scenarios obtains enough lower bound information.

Tables 7 and 8 report the statistical significance of the lower bounds obtained. The difference between Tables 7 and 8 is that the normal distribution of Table 8 has 50% of the standard deviation of the one for Table 7. Again, the normal noise is discretized with S samples to build the sample average problem. The first two columns of Tables 7 and 8 refer to the base tree solved. We observe that “convergence” for Table 8 is faster than for Table 7 because the standard deviation of \(\hat{v}_1^l\) is smaller relative to the lower bounds. For Table 7 the standard deviation of \(\hat{v}_1^l\) is around 1.3% and the maximum deviation is around 4.7%, while for Table 8 these numbers are 1.0% and 3.7%, respectively.

Again we note that the averages of \(\hat{v}_1^l\) are stable across S, as also observed in Fig. 1. As explained before, for big \(S \ge 100\) we have the lower bound slightly different from the expectation of \(\hat{v}_1^l\) because of numerical errors from the free-floating part of the cut. The quantity \(\max _l |\hat{v}_1^l - \mathbb {E}\hat{v}_1^l|\) goes to zero because the problem has bounded feasible sets and is always feasible. Otherwise, sampling enough would make it diverge. The evolution of the histograms for \(\hat{v}_1^l\) as a function of S for both Tables 7 and 8 are shown in Fig. 4.

Table 7 This tables shows the sensitivity of the cutting planes for a normal noise with zero mean
Table 8 The normal noise for this experiments has 50% the standard deviation of the one for Table 7
Fig. 4
figure 4

Histograms for the realizations of \(\hat{v}_1^l\) for Table 7 (left) and Table 8 (right). The results are similar to Fig. 1. Note that the histogram for \(S=200\) on the right has a range of about 13 units around the mean, as reported on Table 8 as the maximum deviation, which represents about 4% of the optimal value

Tables 7 and 8 and Fig. 4 seem to suggest that taking \(S=200\) is enough to reduce the variance of the realizations of the optimal values associated with different trees to small values, such as the 1% shown in Tables 7 and 8. However, this conclusion may depend on other considerations and on the circumstance, since the variance of the optimal values associated with \(S=30\) is also small (around 3%). Nonetheless, the maximum variation from the empirical mean of \(\hat{v}^l_1\) varies substantially more. For instance, around 10% of the optimal value for \(S = 30\) and about 3% of the optimal value for \(S=200\), which seems to indicate that \(S \ge 200\) might be a safe choice.

6 Extensions via duality

In this section we replicate the dual problems found in [11] and explain how our strategy for cut-sharing across trees applies. The dynamic programming equations (DPE) for the dual SDDP method [11, 12] can be derived by (i) writing the dual of the deterministic equivalent primal problem and (ii) identifying stage-wise decomposition for the dual using concave value functions. Once this is performed, in spite of problems relating to feasibility that should be dealt with carefully [11], the solution procedures for the dual tend to look familiar because the structure of the problem is also familiar.

Note that the derivation of the dynamic programming equations for the dual SDDP and the proper explanation of the underlying issues relating to feasibility is a delicate topic, which we do not intend to deal with here. For the application of our strategy, it is enough to examine the structure of the DPE and recognize it is similar enough to the primal equations. Once one realizes the structure is the same, it is enough to transform the maximization problems to minimization problems and apply the cut-sharing as described before.

The data process is denoted by \(\xi _t = (c_t, W_t, B_t, h_t)\) and assumed to have finitely many realizations for all stages denoted by \(\xi _{ts}\) for \(s=1,\dots ,S\). We also use \(c_t = c_t(\xi _t)\), \(B_t = B_t(\xi _t)\). Note that \(h_t\) and \(W_t\) are not random. The history of the process is denoted by \(\xi _{[t]} = (\xi _1, \dots , \xi _t)\). Note that \(\xi _1\) is deterministic. The stage-wise independence translates into assuming that \(\xi _t\) is independent of \(\xi _{[t-1]}\). The primal and dual multistage stochastic deterministic equivalent problems are respectively given by

$$\begin{aligned} \min _{x_t(\cdot )} \quad \mathbb {E} \left[ \sum _{t=1}^T c_t x_t \right] \quad \text{s.t.} \quad x_t \ge 0, \quad W_1 x_1 = h_1, \quad W_t x_t = h_t - B_t x_{t-1} \quad \forall t=2,\dots ,T \end{aligned}$$
(24)

and

$$\begin{aligned} \max _{\pi _t(\cdot )} \quad \mathbb {E} \left[ \sum _{t=1}^T h_t \pi _t \right] \quad \text{s.t.} \quad W_T^\intercal \pi _T \le c_T, \quad W_{t-1}^\intercal \pi _{t-1} + \mathbb {E}_{| \xi [t-1]} \left[ B_t^\intercal \pi _t \right] \le c_{t-1} \quad \forall t=2,\dots ,T \end{aligned}$$
(25)

where the optimization is performed over policies \(x_t = x_t(\xi _{[t]})\) and \(\pi _t = \pi _t(\xi _{[t]})\) so that the nonanticipativity constraints are already implicitly being considered and the probability space where the expectation is considered is properly constructed. Standard duality theory applies to the pair (24)–(25).

As is widely known, the upper bound for the primal SDDP is calculated via Monte Carlo simulation. It is considered a weakness of the method because this process is slow and stochastic. One of the motivations of [11, 12] was to generate deterministic upper bounds for the optimal value of a SDDP problem. The first stage problem of the dual SDDP is given

$$\begin{aligned} \max _{\pi _1} \quad h_1^\intercal \pi _1 + V_{2}(\pi _1, \xi _1). \end{aligned}$$
(26)

For \(t=2,\dots ,T-1\) the definition of \(V_{t}(\pi _{t-1}, \xi _{t-1})\) is given by

$$\begin{aligned} \max _{\pi _{t1}, \dots , \pi _{tS}} \quad S^{-1} \sum _{s=1}^S h_{ts}^\intercal \pi _{ts} + S^{-1} \sum _{s=1}^S V_{t+1}(\pi _{ts}, \xi _{ts}) \quad \text{s.t.} \quad W_{t-1}^\intercal \pi _{t-1} + S^{-1} \sum _{s=1}^S B_{ts}^\intercal \pi _{ts} \le c_{t-1}. \end{aligned}$$
(27)

Finally, the definition of \(V_T(\pi _{T-1}, \xi _{T-1})\) is

$$\begin{aligned} \max _{\pi _{T1}, \dots , \pi _{TS}} \quad S^{-1} \sum _{s=1}^S h_{Ts}^\intercal \pi _{Ts} \quad \text{s.t.} \quad W_{T-1}^\intercal \pi _{T-1} + S^{-1} \sum _{s=1}^S B_{Ts}^\intercal \pi _{Ts} \le c_{T-1}, \quad W_{Ts}^\intercal \pi _{Ts} \le c_{Ts} \quad \forall s. \end{aligned}$$

First of all, note that (26) is unconstrained. Therefore, we run into bad iterates if the first forward passes are performed without care. This is so because we use upper bounding cutting plane models for these value functions. Secondly, the subproblems might be bigger because they involve all \(\pi _{t1}, \dots , \pi _{tS}\) for all scenarios.

For our purposes, we have to note only that inside (27) the cost vectors of the primal appear in the RHS as well as \(\pi _{t-1}\). Therefore, we can generate free-floating cuts approximating \(V_t\) in terms not only of \(\pi _{t-1}\) but also in terms of the realizations of the cost. This information is passed backwards replicating the RHS parameters in (27) analogously to what we have done before. The two-stage case is written below to illustrate. The first stage problem is still (26) and the second stage value function \(V_2(\pi _1, c_{21}, \dots , c_{2S})\) is

$$\begin{aligned} \max _{\pi _{21}, \dots , \pi _{2S}} \quad S^{-1} \sum _{s=1}^S h_{2s}^\intercal \pi _{2s} \quad \text{s.t.} \quad W_1^\intercal \pi _1 + S^{-1} \sum _{s=1}^S B_{2s}^\intercal \pi _{2s} \le c_1, \quad W_{2s}^\intercal \pi _{2s} \le c_{2s} \quad \forall s. \end{aligned}$$

Assume given an iterate \((\hat{\pi }_1, \hat{c}_{21}, \dots , \hat{c}_{2S})\) where \(\hat{c}_{21}, \dots , \hat{c}_{2S}\) are obtained via sampling. Using formulas similar to (3) we can generate supergradient vectors \((\hat{\alpha }_1, \hat{\beta }_{21}, \dots , \hat{\beta }_{2S})\) such that

$$\begin{aligned} V_2(\pi _1, c_{21}, \dots , c_{2S}) \le V_2(\hat{\pi }_1, \hat{c}_{21}, \dots , \hat{c}_{2S}) + \hat{\alpha }^\intercal (\pi _1 - \hat{\pi }_1) + \sum _{s=1}^S \hat{\beta }_{2s}^\intercal (c_{2s} - \hat{c}_{2s}) \quad \forall \pi _1, \quad \forall c_{2s}. \end{aligned}$$
(28)

Inside a cutting planes method, formula (28) is used to improve a polyhedral approximation of the first stage problem (26) so that upper bounds for new cost vectors can be computed fast once one base tree is solved.

7 Conclusions

In this paper we present algorithms to compute valid lower bounds for RHS perturbations of MSOP that can be obtained at minor computational costs. The approach is based on the extensive use of certain extended cutting planes, which we call free-floating cuts. We present numerical experiments justifying the intuition that such cutting planes can be used to measure the convergence of discretized problems to the true problem. It would be a nice future development to perform the statistical analysis of the algorithms proposed. Regarding extensions of the approach, it can be applied as soon as all the parameters of a value function are in the RHS and the optimization problem defining the value function is convex via extensions of the Benders cut. Therefore, it can also be applied if prox-centers are added to the objective at each stage.