Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

The early approaches for optimal control for ordinary differential equations (ODEs) were mostly based on Dynamic Programming [3] or Pontryagin’s Maximum Principle [16]. Due to the curse of dimensionality, Dynamic Programming is not applicable to problems with more than a few state variables. The Maximum Principle is an indirect method in the sense that the controls are given as a closed form representation of adjoint states, the so-called co-states, which satisfy an adjoint differential equation and suitable boundary constraints. The resulting boundary value problems in the states and co-states can then be discretized by numerical methods for boundary value problems like Multiple Shooting [15] or Collocation [21]. The Maximum Principle is a typical example for an optimize-then-discretize approach. It was found out later that ideas from numerical methods for boundary value problems can also be used on the optimal control problem itself instead of its optimality conditions, thus disposing of the need to formulate adjoint equations, which can be a time-consuming affair (discretize-then-optimize). Furthermore, this approach lends itself quite naturally to a direct formulation, in which the discretized controls are not eliminated in favor of adjoint variables. For Direct Collocation, we refer the reader to [4, 25] and for Direct Multiple Shooting to [6]. Direct formulations have been found to reflect the real conditioning of the optimal control problem better, in particular when the mapping of the adjoint to the control has a large Lipschitz constant. We want to remark that there are also direct optimize-then-discretize methods for which the control is kept in the system of optimality conditions and its discretization.

In optimal control for partial differential equations (PDEs), indirect optimize-then-discretize methods prevail (see, e.g., [24]). This is mostly due to the fact that indirect methods usually yield more accurate representations of the control (see also [11, Chap. 3]).

The contribution of this article is to summarize a Direct Multiple Shooting method for parabolic PDE constrained problems from Potschka [17, 18] in a concise form and to present numerical results for problems that could also be treated well with indirect methods. The main challenge in this approach is to tame the large-scale nature of the discretized Nonlinear Programming problems (NLPs), which can be mastered by the use of inexact Sequential Quadratic Programming (SQP) with two-grid Newton-Picard preconditioning.

The main strengths of the Direct Multiple Shooting approach can unfold in particular for practitioners who want to solve parabolic PDE constrained optimization problems and need an accurate representation of the system dynamics, but can live with lower resolution of the optimal control. This situation is not unusual due to physical restrictions of manipulator hardware. These practitioners can then benefit from time savings in the problem setup, because no adjoint equations need to be derived, but also in the problem solution, because the method can be easily parallelized on the basis of the Multiple Shooting structures. Furthermore, the solutions delivered by Direct Multiple Shooting can serve as initial guesses for other methods with higher accuracy.

2 Problem Formulation

Let \(\varOmega \subset \mathbb{R}^{d}\) be a bounded d-dimensional spatial domain with sufficiently regular boundary ∂ Ω. Regarding time, we only consider the fixed interval (0, 1) here for simplicity. We assume that the control satisfies

$$\displaystyle{q \in L^{2}((0,1);Q),\quad \text{where}\quad Q \subseteq L^{2}(\varOmega )^{n_{q}^{\mathrm{d}} } \times L^{2}(\partial \varOmega )^{n_{q}^{\mathrm{b}} },}$$

with n q d distributed and n q b boundary controls. We would already like to point out that the pure boundary control case n q d = 0 is especially advantageous in a Multiple Shooting approach, as we describe in Sect. 3.

We consider two kinds of coupled state variables: ODE states \(v \in C^{1}([0,1]; \mathbb{R}^{n_{v}})\), which are not spatially distributed, and PDE states \(u \in W(0,1)^{n_{u}}\), for which we use the standard Hilbert space for parabolic PDEs

$$\displaystyle{W(0,1) =\{ u \in L^{2}((0,1);V )\ \vert \ \partial _{ t}u \in L^{2}((0,1);V ^{{\ast}})\},}$$

where (V, L 2(Ω), V ) is a Gelfand triple and t u denotes the time derivative of u in the sense of vectorial distributions (see, e.g., [29, Chap. IV]). We assume that the spatial regularity of u is \(V \subseteq H^{1}(\varOmega )\). According to [7, Chap. XVIII, Theorem 1], we can trade some spatial regularity for higher temporal regularity, because W(0, 1) is continuously embedded in C 0([0, 1]; L 2(Ω)). Thus, we have existence of the trace \(u(t) \in L^{2}(\varOmega )^{n_{u}}\) for all t ∈ [0, 1], which is important for the formulation of boundary constraints in time.

We assume that the time derivative \(\dot{v}\) equals a sufficiently regular nonlinear function \(f^{\mathrm{ODE}}: Q \times L^{2}(\varOmega )^{n_{u}} \times \mathbb{R}^{n_{v}} \rightarrow \mathbb{R}^{n_{v}}\) and − t u equals a possibly nonlinear elliptic differential operator \(A: Q \times V \times \mathbb{R}^{n_{v}} \rightarrow V ^{{\ast}}\). Furthermore, we allow for generalized temporal boundary constraints, mixed control-ODE-state path constraints and final ODE state constraints via the functions

$$\displaystyle\begin{array}{rcl} r^{\mathrm{b}}: L^{2}(\varOmega )^{n_{u} } \times \mathbb{R}^{n_{v} } \rightarrow L^{2}(\varOmega )^{n_{u} } \times \mathbb{R}^{n_{v} },\quad r^{\mathrm{c}}: Q \times \mathbb{R}^{n_{v} } \rightarrow \mathbb{R}^{n_{\mathrm{r}}^{\mathrm{c}} },\quad r^{\mathrm{e}}: \mathbb{R}^{n_{v} } \rightarrow \mathbb{R}^{n_{\mathrm{r}}^{\mathrm{e}} }.& & {}\\ \end{array}$$

The objective function \(\varPhi: L^{2}(\varOmega )^{n_{u}} \times \mathbb{R}^{n_{v}} \rightarrow \mathbb{R}\) depends on the final values of the ODE and PDE states. Finally, we can state the problem of interest of this article as

$$\displaystyle\begin{array}{rcl} \mathop{\mathrm{minimize}}\limits _{ \begin{array}{c}q\in L^{2}((0,1);Q) \\ u\in W(0,1)^{n_{u}} \\ v\in C^{1}([0,1];\mathbb{R}^{n_{v}})\end{array}}\ & \varPhi (u(1),v(1))&{}\end{array}$$
(1a)
$$\displaystyle\begin{array}{rcl} \mathop{\mathrm{s.t.}}\limits \quad \partial _{t}u = -A(q(t),u(t),v(t)),\quad & t \in (0,1),&{}\end{array}$$
(1b)
$$\displaystyle\begin{array}{rcl} \dot{v} = f^{\mathrm{ODE}}(q(t),u(t),v(t)),\quad t \in (0,1),& &{}\end{array}$$
(1c)
$$\displaystyle\begin{array}{rcl} (u(0),v(0)) = r^{\mathrm{b}}(u(1),v(1)),\ & &{}\end{array}$$
(1d)
$$\displaystyle\begin{array}{rcl} r^{\mathrm{c}}(q(t),v(t)) \geq 0,\quad t \in (0,1),& &{}\end{array}$$
(1e)
$$\displaystyle\begin{array}{rcl} r^{\mathrm{e}}(v(1)) \geq 0.& &{}\end{array}$$
(1f)

More general problems with free initial and final time or integral-type objectives can be equivalently reformulated using extra v variables to fit problem class (1). Further assumptions concerning regularity requirements on the occurring functions above is a delicate issue. From a practical point of view, we require that (1) is well-posed and that there exists an appropriate discretization for the involved variables and functions such that the solution of the resulting finite dimensional optimization problem is consistent with (1).

3 Discretization

In Potschka [17], problem (1) is discretized in space first and then in time. We explain here a different route, where time is discretized before space. In both cases, though, the resulting discretized problem (3) exhibits the same structure.

3.1 Time Discretization

We start out by discretizing the controls in time. To this end, let \(0 = t^{0} <\ldots < t^{n_{\text{MS}}} = 1\) denote a partition of the interval [0, 1]. On the shooting intervals \(I^{i}:= (t^{i},t^{i+1}),i = 0,\mathop{\ldots },n_{\text{MS}} - 1\), we perform a piecewise discretization of the controls in time. The piecewise nature is important for decoupling properties to be exploited in the numerics. We restrict ourselves here to the simplest case of a piecewise constant discretization in time such that

$$\displaystyle{\tilde{q} (t) =\sum _{ j=0}^{n_{\text{MS}}-1 }q^{j}\chi _{ I^{j}}(t),\quad \text{with }q^{i} \in Q,i = 0,\mathop{\ldots },n_{ \text{MS}} - 1,}$$

where \(\chi _{I^{i}}\) denotes the characteristic function of I i.

In the next step, we parametrize the state variables by local initial values \(u^{i},\boldsymbol{v}^{i},i = 0,\mathop{\ldots },n_{\text{MS}}\): We assume that the local initial value problems of the form (1b)–(1c) on I i with initial values \(u^{i},\boldsymbol{v}^{i}\) admit a unique solution denoted by

$$\displaystyle{\left (\begin{array}{*{10}c} \overline{u}^{i}(t;q^{i},u^{i},\boldsymbol{v}^{i}) \\ \overline{v}^{i}(t;q^{i},u^{i},\boldsymbol{v}^{i}) \end{array} \right ) \in L^{2}(\varOmega )^{n_{u} }\times \mathbb{R}^{n_{v} },\quad \text{for all }t \in I^{i},i = 0,\mathop{\ldots },n_{ \text{MS}}-1.}$$

Continuity of the states on the full time horizon (0, 1) must then be enforced with matching conditions of the form

$$\displaystyle{ \overline{u}^{i}(t;q^{i},u^{i},\boldsymbol{v}^{i}) = u^{i+1},\quad \overline{v}^{i}(t;q^{i},u^{i},\boldsymbol{v}^{i}) =\boldsymbol{ v}^{i+1},\quad i = 0,\mathop{\ldots },n_{ \text{MS}} - 1. }$$
(2)

At this stage of the discretization procedure, we have arrived at a problem depending only on variables that are either real vectors or functions distributed in space only, namely \(q^{i},u^{i},\boldsymbol{v}^{i},i = 0,\mathop{\ldots },n_{\text{MS}}\). Note, that we have en passant added an additional control variable \(q^{n_{\text{MS}}}\) to lend a common structure to the optimization variables corresponding to each shooting node t i . We have to eliminate these additional degrees of freedom later via (3f).

3.2 Space Discretization

Now, we can discretize q i and u i in space, e.g., by Finite Differences, Finite Elements, or Finite Volumes. We denote the corresponding vectors by \(\boldsymbol{q}^{i}\) and \(\boldsymbol{u}^{i}\). The quantities \(\overline{u}^{i}\) and \(\overline{v}^{i}\) can then be discretized by an appropriate time stepping method. We denote the result by \(\overline{\boldsymbol{u}}^{i}(t^{i+1};\boldsymbol{q}^{i},\boldsymbol{u}^{i},\boldsymbol{v}^{i})\) and \(\overline{\boldsymbol{v}}^{i}(t^{i+1};\boldsymbol{q}^{i},\boldsymbol{u}^{i},\boldsymbol{v}^{i})\). It is not mandatory to use the same grids for the controls or states on all shooting intervals, or even within one shooting interval for the time steps of states in case of a Rothe-type time stepping scheme. According to [10], a good compromise for balancing good mesh adaptivity with few Finite Element matrix assembly calls is to fix the state mesh on each shooting interval and discretize the matching conditions (2) in a variational manner. In this case, a Method of Lines discretization in time seems to be a flexible and efficient approach, because existing ODE solvers can be used. For a more detailed discussion of discretization issues of parabolic PDEs, we refer the reader to [23].

For simplicity, we assume here that \(\boldsymbol{q}^{i}\) and \(\boldsymbol{u}^{i}\) are based on the same spatial discretization in each shooting node. However, we assume that there is a hierarchy of nested discretizations \(V _{h}^{1} \subset V _{h}^{2} \subset \cdots \subset V\) for the states, which in turn yields on each level \(\ell= 1,2,\mathop{\ldots }\) discretized states and shooting solutions. The path constraint is discretized on the shooting grid only. For more sophisticated methods, see Potschka [19].

All remaining functions of problem (1) need to be appropriately discretized as well. Finally, we arrive at a highly structured, finite dimensional optimization problem on each spatial discretization level

$$\displaystyle\begin{array}{rcl} \mathop{\mathrm{minimize}}\limits _{(\boldsymbol{q}^{i },\boldsymbol{u}^{i},\boldsymbol{v}^{i})_{i=0}^{n_{ \text{MS}}}}\ \boldsymbol{\varPhi }(\boldsymbol{u}^{n_{\text{MS}}},\boldsymbol{v}^{n_{\text{MS}}})& &{}\end{array}$$
(3a)
$$\displaystyle\begin{array}{rcl} \mathop{\mathrm{s.t.}}\limits \quad \boldsymbol{r}_{u}^{\mathrm{b}}(\boldsymbol{u}^{n_{\text{MS}}},\boldsymbol{v}^{n_{\text{MS}}}) -\boldsymbol{ u}^{0} = 0,\ & &{}\end{array}$$
(3b)
$$\displaystyle\begin{array}{rcl} \overline{\boldsymbol{u}}^{i}(t^{i};\boldsymbol{q}^{i-1},\boldsymbol{u}^{i-1},\boldsymbol{v}^{i-1}) -\boldsymbol{ u}^{i} = 0,\quad & i = 1,\mathop{\ldots },n_{ \text{MS}},&{}\end{array}$$
(3c)
$$\displaystyle\begin{array}{rcl} \boldsymbol{r}_{v}^{\mathrm{b}}(\boldsymbol{u}^{n_{\text{MS}}},\boldsymbol{v}^{n_{\text{MS}}}) -\boldsymbol{ v}^{0} = 0,\ & &{}\end{array}$$
(3d)
$$\displaystyle\begin{array}{rcl} \overline{\boldsymbol{v}}^{i}(t^{i};\boldsymbol{q}^{i-1},\boldsymbol{u}^{i-1},\boldsymbol{v}^{i-1}) -\boldsymbol{ v}^{i} = 0,\quad i = 1,\mathop{\ldots },n_{ \text{MS}},& &{}\end{array}$$
(3e)
$$\displaystyle\begin{array}{rcl} \boldsymbol{q}^{n_{\text{MS}}} -\boldsymbol{ q}^{n_{\text{MS}}-1 } = 0,& &{}\end{array}$$
(3f)
$$\displaystyle\begin{array}{rcl} \boldsymbol{r}^{\mathrm{i}}(\boldsymbol{q}^{i-1},\boldsymbol{v}^{i-1}) \geq 0,\qquad i = 1,\mathop{\ldots },n_{ \text{MS}},& &{}\end{array}$$
(3g)
$$\displaystyle\begin{array}{rcl} r^{\mathrm{e}}(\boldsymbol{v}^{n_{\text{MS}}}) \geq 0.& &{}\end{array}$$
(3h)

The numerical evaluation of (3c) and (3e) is expensive, because it comprises the solution of an initial value problem for the spatially discretized PDE. To this end, we use an adaptive backward differentiation formula with monitoring strategy [1, 2], which is implemented in the C++ code DAESOL-II. It also supplies efficient first and second order forward and backward directional derivatives of the solutions of the initial value problems on the basis of Internal Numerical Differentiation [5] and Algorithmic Differentiation [9, 28], for which the backwards mode of Algorithmic Differentiation delivers an automated and efficient way to compute gradients in the style of adjoint equations.

We want to remark that special attention has to be paid to the question of consistency of the discretized problem (3) with the original problem (1) for fine discretizations. For instance, it is necessary to use correct weighting matrices for the discrete approximations of Hilbert space inner products and their corresponding norms. Their inverses must also be used for correct discrete Riesz representations of discrete variables that really live in the dual space, for example the Lagrange multipliers of (3b) and (3c). Then, it is typically possible to prove consistency of (3) and (1) for the particular problem at hand.

4 Newton-Picard Inexact SQP

When disregarding its special structure for a moment, we see that problem (3) is a Nonlinear Programming (NLP) problem of the form

$$\displaystyle\begin{array}{rcl} \mathop{\mathrm{minimize}}\limits _{x\in \mathbb{R}^{n}}\qquad f(x)& &{}\end{array}$$
(4a)
$$\displaystyle\begin{array}{rcl} \mathop{\mathrm{s.t.}}\limits \qquad g_{i}(x) = 0,\qquad i \in \mathcal{E},& &{}\end{array}$$
(4b)
$$\displaystyle\begin{array}{rcl} g_{i}(x) \geq 0,\qquad i \in \mathcal{I},& &{}\end{array}$$
(4c)

with \(f: \mathbb{R}^{n} \rightarrow \mathbb{R}\) and \(g: \mathbb{R}^{n} \rightarrow \mathbb{R}^{m}\). We assume throughout that f and g are twice continuously differentiable functions and that the sets \(\mathcal{E}\) and \(\mathcal{I}\) form a partition of \(\{1,\mathop{\ldots },m\} = \mathcal{E}\ \dot{\cup }\ \mathcal{I}\). For \(z = (x,y) \in \mathbb{R}^{n+m}\), we can then define the Lagrangian function

$$\displaystyle{\mathcal{L}(z) = f(x) -\sum _{i=1}^{m}y_{ i}g_{i}(x).}$$

For the theory of NLP we refer the reader to [14].

4.1 The Equality Constrained Case

We first consider the equality constrained case \(\mathcal{I} =\{\}\). If \(x^{{\ast}}\in \mathbb{R}^{n}\) satisfies a constraint qualification and is a local minimizer of (3), then there exists a Lagrange multiplier \(y^{{\ast}}\in \mathbb{R}^{m}\) such that with z  = (x , y )

$$\displaystyle{ F(z^{{\ast}}) = \left (\begin{array}{*{10}c} F_{1}(z^{{\ast}}) \\ F_{2}(z^{{\ast}}) \end{array} \right ) = \left (\begin{array}{*{10}c} \nabla _{x}\mathcal{L}(z^{{\ast}}) \\ g(x^{{\ast}}) \end{array} \right ) = 0. }$$
(5)

In practice, a Newton-type method is used to solve the necessary optimality condition (5) in order to obtain a critical point z . In general we must then check (second order) sufficient conditions to ensure that x is really a minimizer and not a maximizer or saddle-point.

Because of the large-scale nature of (3), it is a good idea to employ iterative methods for the solution of the linearized subproblems within the Newton-type method. We recede to a rather simple iterative method here, namely a Linear Iterative Splitting Approach (LISA), because we can generalize it from linear systems to Quadratic Programming (QP) problems in Sect. 4.4.

Let \(N = n + m,J(z) = \frac{\mathrm{d}F} {\mathrm{d}z} (z),\) and \(\hat{M}: \mathbb{R}^{N} \rightarrow \mathbb{R}^{N\times N}\) be given such that \(\hat{M} (z)J(z) \approx \mathbb{I}_{N}\). Then, the LISA-Newton iterates are defined via

$$\displaystyle{ z^{k+1} = z^{k} +\alpha _{ k}\varDelta z^{k},\quad z^{0}\text{ given}, }$$
(6)

where α k  ∈ (0, 1] is an appropriately chosen damping factor to ensure global convergence and Δ z k is computed via the inner LISA iteration

$$\displaystyle{ \varDelta z_{i+1}^{k} =\varDelta z_{ i}^{k} -\hat{M} (z^{k})\left [J(z^{k})\varDelta z_{ i}^{k} + F(z^{k})\right ],\quad \varDelta z_{ 0}^{k} = 0. }$$
(7)

The iteration (7) converges for all values of F(x k) and Δ z 0 k if and only if the spectral radius κ lin of \(A_{k}:= \mathbb{I}_{N} -\hat{M} (z^{k})J(z^{k})\) is less than 1, see [22, Theorem 4.1].

The choice of the damping factors α k exceeds the scope of this article. The interested reader is referred to the discussion in [8] and its specialization for the LISA-Newton method [17].

4.2 A-Posteriori κ-Estimators

Based on [17, Lemma 4.28], we observe that there is a representation of Δ z l k for l ≥ 1 in terms of a truncated Neumann series

$$\displaystyle{\varDelta z_{l}^{k} = -\left [\sum \nolimits _{ i=0}^{l-1}\left (\mathbb{I}_{ N} -\hat{M} (z^{k})J(z^{k})\right )^{i}\right ]\hat{M} (z^{k})F(z^{k}) =: -M(z^{k})F(z^{k}),}$$

where M(z k) depends on l. On the basis of this M, we can observe the connection

$$\displaystyle{\kappa _{\varepsilon } \in [(\kappa _{\mathrm{lin}})^{l},(\kappa _{\mathrm{ lin}})^{l}+\varepsilon ]\quad \text{for all }\varepsilon > 0}$$

between the linear convergence rate κ lin of the LISA iteration (7) and the linear convergence rate \(\kappa _{\varepsilon }\) (which depends on an \(\varepsilon\)-dependent norm) of the nonlinear LISA-Newton iteration (6), if each Δ z k is computed from l LISA iterations [17, Theorem 4.29]. This implies that under the assumption κ lin < 1, l = 1 is already enough for local convergence and, moreover, if we perform l > 1 LISA iterations per outer step, then this extra effort is compensated for by fewer outer LISA-Newton iterations, at least asymptotically for \(k \rightarrow \infty \).

Furthermore, this result is the basis for a-posteriori κ-estimators [17]. To explain the different estimators, we fix k and define ζ l : = Δ z l+1 kΔ z l k. We immediately observe that ζ l+1 = A k ζ l . The Rayleigh κ-estimator is based on the Rayleigh quotient

$$\displaystyle{\kappa _{l}^{\mathrm{Rayleigh}} = \frac{\zeta _{l}^{\mathrm{T}}A_{ k}\zeta _{l}} {\zeta _{l}^{\mathrm{T}}\zeta _{l}} = \frac{\zeta _{l}^{\mathrm{T}}\zeta _{l+1}} {\zeta _{l}^{\mathrm{T}}\zeta _{l}}.}$$

The sequence (κ l Rayleigh) converges to κ lin for \(l \rightarrow \infty \) if A k is diagonalizable, has a single eigenvalue of largest modulus, and ζ 1 has a nonzero component in the corresponding eigenspace. For non-diagonalizable A k , the Root κ-estimator

$$\displaystyle{\kappa _{l}^{\mathrm{Root}} = (\left \Vert \zeta _{ l+1}\right \Vert /\left \Vert \zeta _{l}\right \Vert )^{1/l}}$$

can be used. It converges if κ lin > 0 and ζ 1 has a nonzero component in the dominant invariant subspace of A k . However, convergence to κ lin can be quite slow. In contrast, the Ritz κ-estimator converges in a finite number of iterations. It is based on the largest Ritz value κ l Ritz of A k on the order-l Krylov subspace generated by A k and ζ 1. The disadvantage is that an orthonormal basis of the Krylov subspace needs to be maintained, which can be prohibitive due to excessive memory consumption. In our numerical experience, this is not a problem because the Newton-Picard based preconditioners \(\hat{M}\) described in Sect. 4.3 require only few LISA iterations.

The κ-estimators can also be used to asses the error of the LISA iteration (7) in the following sense: For all \(\varepsilon \in (0,1 -\kappa _{\mathrm{lin}})\), there exists a vector-norm \(\left \Vert.\right \Vert _{{\ast},k}\) with corresponding matrix-norm \(\left \Vert A_{k}\right \Vert _{{\ast},k} \leq \kappa _{\mathrm{lin}}+\varepsilon < 1\) such that (compare [17, Lemma 4.30])

$$\displaystyle{\left \Vert \varDelta z_{l}^{k} -\varDelta z^{k}\right \Vert _{ {\ast},k} \leq \frac{(\kappa _{\mathrm{lin}}+\varepsilon )^{l}} {1 - (\kappa _{\mathrm{lin}}+\varepsilon )}\left \Vert \varDelta z_{1}^{k} -\varDelta z_{ 0}^{k}\right \Vert _{ {\ast},k}.}$$

In our implementation, we have l ≥ 2 and terminate the LISA iterations as soon as either the κ-estimate is lower than \(\kappa _{\mathrm{max}} = \sqrt{1/2}\) or a maximum number of LISA iterations, in our case l max = 7, is reached. In the latter case, we ameliorate the preconditioner \(\hat{M}\) in order to reduce the contraction rate κ lin as discussed in the following section. In most outer iterations only l = 2 LISA iterations are required. We used the Ritz κ-estimator for the computations in Sect. 6.

4.3 Two-Grid Newton-Picard Preconditioning

The requirement κ lin < 1 for convergence of the LISA iteration (7) is a strong requirement, which calls for specially tailored preconditioners. In our case, Newton-Picard preconditioning is the method of choice for computing \(\hat{M}\), see [20]. Although grid-independent convergence has so far only been proven for a linear quadratic model problem with single shooting, our numerical experience on nonlinear problems with Multiple Shooting is completely satisfying.

The general paradigm of Newton-Picard preconditioners is the following: Under reasonable assumptions, \(\overline{u}^{i}\) has a compact Fréchet derivative with respect to q i and u i. Thus, even though the discretized Jacobian matrices \(G_{q}^{i} = \partial \overline{\boldsymbol{u}}^{i}/\partial \boldsymbol{q}^{i}\) and \(G_{u}^{i} = \partial \overline{\boldsymbol{u}}^{i}/\partial \boldsymbol{u}^{i}\) are in general large dense matrices, their eigenvalues and singular values cluster at 0 and they can thus be approximated well by low-rank matrices.

From a geometrical point of view, the compactness is a result of a smoothing property of \(\overline{u}^{i}\), which can alternatively be exploited to form a low-rank approximation by a two-grid approach. To this end, we approximate the Jacobian matrices G q i and G u i on coarse grids with suitable projection and restriction matrices P and R. This is the preferred way because no large eigenvalue and singular value problems have to be solved. Furthermore, this approach can be extended in a straight-forward fashion to Multiple Shooting approaches.

4.4 The Inequality Constrained Case

We now consider the inequality constrained case when \(\mathcal{I}\neq \{\}\). The Jacobian

$$\displaystyle{J(z) = \left (\begin{array}{*{10}c} J_{1}(z)&-J_{2}(z)^{\mathrm{T}} \\ J_{2}(z)& 0 \end{array} \right ) = \left (\begin{array}{*{10}c} \nabla _{xx}^{2}\mathcal{L}(z)&-\nabla g(x) \\ \nabla g(x)^{\mathrm{T}} & 0 \end{array} \right )}$$

is a saddle-point matrix in unsymmetric form. It is singular in general, e.g., due to rank deficiency of J 2(z) if lower and upper variable bounds are present. Thus, we need to recede to a pseudo-inverse approach. Let us first consider the case of exact derivatives in the SQP method. In this case, we can define a suitable pseudo-inverse

$$\displaystyle{\varDelta z^{k} = J^{\oplus }(z^{k},-F(z^{k}))\quad \text{instead of }\varDelta z^{k} = -M(z^{k})F(z^{k}),}$$

by computing the primal-dual solution \(\tilde{z} = (\tilde{x}, \tilde{y} ) \in \mathbb{R}^{n+m}\) of the QP

$$\displaystyle\begin{array}{rcl} \mathop{\mathrm{minimize}}\limits _{\tilde{x}\in \mathbb{R}^{n}}\qquad \frac{1} {2} \tilde{x}^{\mathrm{T}}J_{ 1}(z^{k})\tilde{x}+ \left (F_{ 1}(z^{k}) - J_{ 1}(z^{k})x^{k} + J_{ 2}(z^{k})^{\mathrm{T}}y^{k}\right )^{\mathrm{T}}\tilde{x}& &{}\end{array}$$
(8a)
$$\displaystyle\begin{array}{rcl} \mathop{\mathrm{s.t.}}\limits \qquad \left (J_{2}(z^{k})\tilde{x}+ \left (F_{ 2}(z^{k}) - J_{ 2}(z^{k})x^{k}\right )\right )_{ i} = 0,\qquad i \in \mathcal{E},& &{}\end{array}$$
(8b)
$$\displaystyle\begin{array}{rcl} \left (J_{2}(z^{k})\tilde{x}+ \left (F_{ 2}(z^{k}) - J_{ 2}(z^{k})x^{k}\right )\right )_{ i} \geq 0,\quad i \in \mathcal{I},& &{}\end{array}$$
(8c)

and then the step \(\varDelta z^{k} = \tilde{z}- z^{k}\). One can show that if QP (8) has a unique solution at z k and if some \(\hat{z} = (\hat{x}, \hat{y} ) \in \mathbb{R}^{N}\) satisfies \(\hat{y} _{i} \geq -y_{i}\) for all \(i \in \mathcal{I}\), then J acts linearly on the second argument like a pseudo-inverse in the sense that \(J^{\oplus }(z^{k},J(z^{k})\hat{z} ) = \hat{z}\) [17, Lemma 4.36].

Furthermore, if the active set of QP (8) does not change from iteration k − 1 to k and the second order sufficient condition and strong complementarity hold, then we can construct a matrix M k such that for a neighborhood U of F(z k) it holds that

$$\displaystyle{-M^{k}\hat{F } = J^{\oplus }(z^{k},-\hat{F } )\quad \text{for all }\hat{F }\in U.}$$

In words, the pseudo-inverse J acts locally like a matrix [17, Theorem 4.37]. In addition, if the SQP method with J converges to z , then z is a critical point of NLP (4). Moreover, the second order sufficiency condition and strong complementarity transfer from QP (8) at z to NLP (4) at z [17, Theorem 4.41].

In the case of inexact derivatives on the basis of the Newton-Picard approximation from Sect. 4.3 we use a structured approximation of the Jacobian

$$\displaystyle{J(z^{k}) = \left (\begin{array}{*{10}c} \nabla _{xx}^{2}\mathcal{L}(z^{k})&-\nabla g(x^{k}) \\ \nabla g(x^{k})^{\mathrm{T}} & 0 \end{array} \right ) \approx \left (\begin{array}{*{10}c} \tilde{B} ^{k}&-(\tilde{C} ^{k})^{\mathrm{T}} \\ \tilde{C} ^{k}& 0 \end{array} \right ) =: \hat{J} ^{k}.}$$

In analogy to the construction of J (z k, . ) from J(z k), we can construct a QP preconditioner \(\hat{M} (z^{k})\) for a QP-LISA iteration. We first compute the primal-dual solution \(\tilde{z} = (\tilde{x}, \tilde{y} )\) of the QP

$$\displaystyle\begin{array}{rcl} \mathop{\mathrm{minimize}}\limits _{\tilde{x}\in \mathbb{R}^{n}}\qquad \frac{1} {2} \tilde{x}^{\mathrm{T}}\tilde{B}^{k}\tilde{x}+ \left (F_{ 1}(x^{k}) -\tilde{C} ^{k}(x^{k} +\varDelta x_{ l-1}^{k}) + (\tilde{C} ^{k})^{\mathrm{T}}(y^{k} +\varDelta y_{ l-1}^{k})\right )^{\mathrm{T}}\tilde{x}& &{}\end{array}$$
(9a)
$$\displaystyle\begin{array}{rcl} \mathop{\mathrm{s.t.}}\limits \qquad \left (\tilde{C} ^{k}\tilde{x}+ \left (F_{ 2}(x^{k}) -\tilde{C} ^{k}(x^{k} +\varDelta x_{ l-1}^{k})\right )\right )_{ i} = 0,\quad i \in \mathcal{E},& &{}\end{array}$$
(9b)
$$\displaystyle\begin{array}{rcl} \left (\tilde{C} ^{k}\tilde{x}+ \left (F_{ 2}(x^{k}) -\tilde{C} ^{k}(x^{k} +\varDelta x_{ l-1}^{k})\right )\right )_{ i} \geq 0,\quad i \in \mathcal{I},& &{}\end{array}$$
(9c)

and then \(\varDelta z_{l}^{k} = \tilde{z}- z^{k} -\varDelta z_{l-1}^{k}\).

In order to have valid κ-estimators, the working set of QP (9) should be fixed for l > 1, effectively deferring further changes of the active set to the next outer iteration.

5 Numerical Solution of the Large-Scale QPs

In order to solve the large-scale QP (9) efficiently, we need to exploit the specific block-sparse and low-rank structures generated in the Multiple Shooting and Newton-Picard approaches. To this end, we first regroup the n 1 PDE variables and n 2 non-PDE variables of NLP (3) in the order

$$\displaystyle{(\boldsymbol{u}^{0},\mathop{\ldots },\boldsymbol{u}^{n_{\text{MS}}}\,\vert \,\boldsymbol{v}^{0},\mathop{\ldots },\boldsymbol{v}^{n_{\text{MS}}},\boldsymbol{q}^{0},\mathop{\ldots },\boldsymbol{q}^{n_{\text{MS}}}) = (x_{ 1}\,\vert \,x_{2}) \in \mathbb{R}^{n_{1}+n_{2} }.}$$

Then we observe that we can also divide the constraints of NLP (3) to obtain the structured NLP form

$$\displaystyle\begin{array}{rcl} \mathop{\mathrm{minimize}}\limits _{(x_{1},x_{2})\in \mathbb{R}^{n_{1}+n_{2}}}\ \qquad f(x_{1},x_{2})& &{}\end{array}$$
(10a)
$$\displaystyle\begin{array}{rcl} \mathop{\mathrm{s.t.}}\limits \qquad g_{i}(x_{1},x_{2}) = 0,\quad i \in \mathcal{E}_{1},& &{}\end{array}$$
(10b)
$$\displaystyle\begin{array}{rcl} g_{i}(x_{1},x_{2}) = 0,\quad i \in \mathcal{E}_{2},& &{}\end{array}$$
(10c)
$$\displaystyle\begin{array}{rcl} g_{i}(x_{1},x_{2}) \geq 0,\quad i \in \mathcal{I},& &{}\end{array}$$
(10d)

where \(\vert \mathcal{E}_{1}\vert = n_{1}\). The constraints (10b) comprise the PDE boundary constraints (3b) and PDE matching conditions (3c).

5.1 Multiple Shooting Structure Exploitation

The QP constraint matrix C has the form

$$\displaystyle\begin{array}{rcl} C& =& \left (\begin{array}{*{10}c} C_{11} & C_{12} \\ C_{21} & C_{22} \\ C_{31} & C_{32} \end{array} \right ) {}\\ & =& \left (\begin{array}{cccc|cccccccc} - \mathbb{I}_{n_{u}} & & & R_{uu}^{\mathrm{b}} & & & & R_{uv}^{\mathrm{b}} \\ G_{u}^{1} & - \mathbb{I}_{n_{u}} & & & G_{v}^{1} & & & & G_{q}^{1}\\ & \ddots & \ddots & & & \ddots & & & & \ddots \\ & & G_{u}^{n_{\text{MS}}} & - \mathbb{I}_{ n_{u}} & & & G_{v}^{n_{\text{MS}}} & & && G_{ q}^{n_{\text{MS}}} \\ \hline & & & R_{vu}^{\mathrm{b}} & - \mathbb{I}_{n_{v}} & & & R_{vv}^{\mathrm{b}} \\ H_{u}^{1} & & & & H_{v}^{1} & - \mathbb{I}_{n_{v}} & & & H_{q}^{1}\\ & \ddots & & & & \ddots & \ddots & & & \ddots \\ & & H_{u}^{n_{\text{MS}}}& & & & H_{ v}^{n_{\text{MS}}} & - \mathbb{I}_{ n_{v}} & & & H_{q}^{n_{\text{MS}}} \\ & & & & & & & & && \mathbb{I}_{n_{q}} & - \mathbb{I}_{n_{q}} \\ \hline & & & & R_{v}^{\mathrm{i},1} & & & & R_{q}^{\mathrm{i},1}\\ & & & & & \ddots & & & & \ddots \\ & & & & & & R_{v}^{\mathrm{i},n_{\text{MS}}}& & &&R_{ q}^{\mathrm{i},n_{\text{MS}}} \\ & & & & & & &R_{v}^{\mathrm{e},n_{\text{MS}}} \end{array} \right ),{}\\ \end{array}$$

with the derivative abbreviations

$$\displaystyle\begin{array}{rcl} R_{uu}^{\mathrm{b}}& =& \frac{\partial \boldsymbol{r}_{u}^{\mathrm{b}}} {\partial \boldsymbol{u}^{n_{\text{MS}}}},\quad R_{uv}^{\mathrm{b}} = \frac{\partial \boldsymbol{r}_{u}^{\mathrm{b}}} {\partial \boldsymbol{v}^{n_{\text{MS}}}},\quad R_{vu}^{\mathrm{b}} = \frac{\partial \boldsymbol{r}_{v}^{\mathrm{b}}} {\partial \boldsymbol{u}^{n_{\text{MS}}}},\quad R_{vv}^{\mathrm{b}} = \frac{\partial \boldsymbol{r}_{v}^{\mathrm{b}}} {\partial \boldsymbol{v}^{n_{\text{MS}}}},\quad R^{\mathrm{e}} = \frac{r^{\mathrm{e}}} {\partial \boldsymbol{v}^{n_{\text{MS}}}}, {}\\ G_{q}^{i}& =& \frac{\partial \overline{\boldsymbol{u}}^{i}} {\partial \boldsymbol{q}^{i-1}},\quad G_{u}^{i} = \frac{\partial \overline{\boldsymbol{u}}^{i}} {\partial \boldsymbol{u}^{i-1}},\quad G_{v}^{i} = \frac{\partial \overline{\boldsymbol{u}}^{i}} {\partial \boldsymbol{v}^{i-1}},\quad R_{q}^{\mathrm{i},i} = \frac{\partial \boldsymbol{r}^{\mathrm{i}}} {\partial \boldsymbol{q}^{i-1}}, {}\\ H_{q}^{i}& =& \frac{\partial \overline{\boldsymbol{v}}^{i}} {\partial \boldsymbol{q}^{i-1}},\quad H_{u}^{i} = \frac{\partial \overline{\boldsymbol{v}}^{i}} {\partial \boldsymbol{u}^{i-1}},\quad H_{v}^{i} = \frac{\partial \overline{\boldsymbol{v}}^{i}} {\partial \boldsymbol{v}^{i-1}},\quad R_{v}^{\mathrm{i},i} = \frac{\partial \boldsymbol{r}^{\mathrm{i}}} {\partial \boldsymbol{v}^{i-1}}, {}\\ \end{array}$$

at the current iterate z k. We want to stress that contrary to the appearance the block C 11 is several orders of magnitude larger than the blocks C 22 and C 32 on fine spatial discretization levels.

Let \(M_{B} = \mathbb{I}_{n_{u}} - (\prod _{i=1}^{n_{\text{MS}}}G_{ u}^{i})R_{ uu}^{\mathrm{b}}\). It is easy to see that if M B is invertible, then C 11 is also invertible and its inverse is given by C 11 −1 = 

$$\displaystyle\begin{array}{rcl} \left (\begin{array}{*{10}c} -\mathbb{I}& & & -(\prod _{i=1}^{0}G_{u}^{i})R_{uu}^{\mathrm{b}}\\ & \ddots & & \vdots \\ & & -\mathbb{I}&-(\prod _{i=1}^{n_{\text{MS}}-1 }G_{u}^{i})R_{uu}^{\mathrm{b}} \\ & & & -\mathbb{I} \end{array} \right )\left (\begin{array}{*{10}c} \mathbb{I}\\ &\ddots\\ &&\mathbb{I} \\ & & &M_{B}^{-1} \end{array} \right )\left (\begin{array}{*{10}c} \mathbb{I} \\ \prod _{i=1}^{1}G_{u}^{i} & \mathbb{I}\\ \vdots & \ddots & \ddots \\ \prod _{i=1}^{n_{\text{MS}}}G_{ u}^{i}&\cdots &\prod _{ i=n_{\text{MS}}}^{n_{\text{MS}}}G_{ u}^{i}&\mathbb{I} \\ \end{array} \right ).& & {}\\ \end{array}$$

This decomposition into block-triangular matrices yields a procedural recipe for evaluating matrix-vector products with C 11 −1 that does not require to explicitly form either C 11 or any part of C 11 −1, except for the M B −1 block. In the following, we shall see that a two-grid Newton-Picard approximation of C 11 also enables us to use a simple procedural form for matrix-vector products with the approximation of M B −1.

5.2 Two-Grid Newton-Picard Structure Exploitation

To this end, we need to establish grid transfer operators between a coarse and a fine grid. For two grid levels c ≤  f, we denote the PDE-state degrees of freedom by \(n_{\mathrm{c}} = n_{u}^{\ell_{\mathrm{c}}}\) and \(n_{\mathrm{f}} = n_{u}^{\ell_{\mathrm{f}}}\). Typically, n c ≈ 100, while n f > 10, 000. We then assume that the prolongation operator \(P \in \mathbb{R}^{n_{\mathrm{f}}\times n_{\mathrm{c}}}\) and the restriction operator \(R \in \mathbb{R}^{n_{\mathrm{c}}\times n_{\mathrm{f}}}\) satisfy

$$\displaystyle{ RP = \mathbb{I}_{n_{\mathrm{c}}}. }$$
(11)

On nested grids, it is rather simple to construct P via interpolation. The restriction of \(v_{\mathrm{f}} \in \mathbb{R}^{n_{\mathrm{f}}}\) can then be computed as the unique coarse grid vector \(v_{\mathrm{c}} \in \mathbb{R}^{n_{\mathrm{c}}}\) whose prolongation minimizes the L 2-distance to v f on the fine grid. In case of a variational spatial discretization, we can use the coarse and fine grid mass matrices \(M_{\mathrm{c}} \in \mathbb{R}^{n_{\mathrm{c}}\times n_{\mathrm{c}}}\) and \(M_{\mathrm{f}} \in \mathbb{R}^{n_{\mathrm{f}}\times n_{\mathrm{f}}}\) in order to obtain the form

$$\displaystyle{R = M_{\mathrm{c}}^{-1}P^{\mathrm{T}}M_{\mathrm{ f}},}$$

which satisfies (11) due to P T M f P = M c. Again, the matrix R should not be formed explicitly if M c −1 is a dense matrix.

We then approximate the blocks of C via a two-grid approach. To this end, we denote all matrices on the coarse grid with hats (\(\hat{\ }\)) and approximate C by a matrix \(\tilde{C}\) of the same block pattern, but consisting of the approximated blocks

$$\displaystyle\begin{array}{rcl} \tilde{R}_{uu}^{\mathrm{b}} = P \hat{R} _{ uu}^{\mathrm{b}}R,\qquad \tilde{R} _{ uv}^{\mathrm{b}} = P \hat{R} _{ uv}^{\mathrm{b}},\qquad \tilde{R} _{ vu}^{\mathrm{b}} = \hat{R} _{ vu}^{\mathrm{b}}R,\qquad \tilde{R} _{ vv}^{\mathrm{b}} = \hat{R} _{ vv}^{\mathrm{b}},& &{}\end{array}$$
(12a)
$$\displaystyle\begin{array}{rcl} \tilde{G}_{q}^{i} = P \hat{G} _{ q}^{i},\qquad \tilde{G} _{ u}^{i} = P \hat{G} _{ u}^{i}R,\qquad \tilde{G} _{ v}^{i} = P \hat{G} _{ v}^{i},& &{}\end{array}$$
(12b)
$$\displaystyle\begin{array}{rcl} \tilde{H}_{q}^{i} = \hat{H} _{ q}^{i},\qquad \tilde{H} _{ u}^{i} = \hat{H} _{ u}^{i}R,\qquad \tilde{H} _{ v}^{i} = \hat{H} _{ v}^{i},& &{}\end{array}$$
(12c)
$$\displaystyle\begin{array}{rcl} \tilde{R}_{q}^{\mathrm{i},i} = \hat{R} _{ q}^{\mathrm{i},i},\qquad \tilde{R} _{ v}^{\mathrm{i},i} = \hat{R} _{ v}^{\mathrm{i},i},\qquad \tilde{R} ^{\mathrm{e}} = \hat{R} ^{\mathrm{e}}.& &{}\end{array}$$
(12d)

Along the same lines, we use the approximations

$$\displaystyle\begin{array}{rcl} \hat{G}_{B}&:=& \left (\prod _{i=1}^{n_{\text{MS}}}\hat{G} _{ u}^{i}\right )\hat{R}_{ uu}^{\mathrm{b}},\qquad \hat{M} _{ B}:= \mathbb{I}_{n_{\mathrm{c}}} -\hat{G} _{B}, {}\\ \tilde{G}_{B}&:=& \left (\prod _{i=1}^{n_{\text{MS}}}\tilde{G} _{ u}^{i}\right )\tilde{R}_{ uu}^{\mathrm{b}},\qquad \tilde{M} _{ B}:= \mathbb{I}_{n_{\mathrm{f}}} -\tilde{G} _{B}. {}\\ \end{array}$$

We can then observe that if \(\hat{M} _{B}\) is invertible, then \(\tilde{M} _{B}\) is also invertible and we obtain the procedural recipe [17, Lemma 7.2]

$$\displaystyle{\tilde{M} _{B}^{-1} = \left (\mathbb{I}_{ n_{\mathrm{f}}} - P \hat{G} _{B}R\right )^{-1} = \mathbb{I}_{ n_{\mathrm{f}}} - P\left (\mathbb{I}_{n_{\mathrm{c}}} -\hat{M} _{B}^{-1}\right )R.}$$

Thus, we do not need to explicitly form any matrix but \(\hat{M}_{B}\) on the coarse grid in order to evaluate matrix-vector products with \(\tilde{M}_{B}\) and thus also with \(\tilde{C}_{11}^{-1}\). Furthermore, this form of \(\tilde{M} _{B}\) allows for an even more concise representation of \(\tilde{C}_{11}^{-1}\). To this end, we define the projectors

$$\displaystyle\begin{array}{rcl} \varPi ^{\mathrm{slow}} = \mathbb{I}_{ n_{\text{MS}}} \otimes (PR),\quad \varPi ^{\mathrm{fast}} = \mathbb{I}_{ n_{\text{MS}}n_{\mathrm{f}}} -\varPi ^{\mathrm{slow}},& & {}\\ \end{array}$$

where ⊗ denotes the Kronecker product of matrices. Thus Π slow is a block-diagonal matrix of n MS blocks PR. Then, we can show [17, Theorem 7.3] that

$$\displaystyle\begin{array}{rcl} \tilde{C}_{11}^{-1}\varPi ^{\mathrm{slow}} = (\mathbb{I}_{ n_{\text{MS}}} \otimes P)\hat{C} _{11}^{-1}(\mathbb{I}_{ n_{\text{MS}}} \otimes R),\quad \tilde{C} _{11}^{-1}\varPi ^{\mathrm{fast}} = -\varPi ^{\mathrm{fast}}.& & {}\\ \end{array}$$

This equation illustrates the name Newton-Picard: On the slowly converging modes, we approximate the inverse of C 11 by its coarse grid counterpart and perform a Newton iteration on this subspace, while we only use a Picard (or fixed point) iteration on the anyway fast converging modes through approximating the Jacobian with a negative identity. Moreover, we can now easily derive [17, Corollary 7.4] that if it exists, the Newton-Picard approximation of block C 11 has the inverse

$$\displaystyle{\tilde{C} _{11}^{-1} = (\mathbb{I}_{ n_{\text{MS}}} \otimes P)\left (\hat{C} _{11}^{-1} + \mathbb{I}_{ n_{\text{MS}}n_{\mathrm{c}}}\right )(\mathbb{I}_{n_{\text{MS}}} \otimes R) - \mathbb{I}_{n_{\text{MS}}n_{\mathrm{f}}}.}$$

5.3 QP Condensing

We now consider again QPs with a structure inherited from NLP (10)

$$\displaystyle\begin{array}{rcl} \mathop{\mathrm{minimize}}\limits _{(x_{1},x_{2})\in \mathbb{R}^{n_{1}+n_{2}}}\qquad \frac{1} {2}\left (\begin{array}{*{10}c} x_{1} \\ x_{2}\end{array} \right )^{\mathrm{T}}\left (\begin{array}{*{10}c} B_{11} & B_{12} \\ B_{21} & B_{22} \end{array} \right )\left (\begin{array}{*{10}c} x_{1} \\ x_{2}\end{array} \right ) + \left (\begin{array}{*{10}c} b_{1} \\ b_{2} \end{array} \right )^{\mathrm{T}}\left (\begin{array}{*{10}c} x_{1} \\ x_{2}\end{array} \right )& &{}\end{array}$$
(13a)
$$\displaystyle\begin{array}{rcl} \mathop{\mathrm{s.t.}}\limits \qquad C_{11}x_{1} + C_{12}x_{2} = c_{1},& &{}\end{array}$$
(13b)
$$\displaystyle\begin{array}{rcl} C_{21}x_{1} + C_{22}x_{2} = c_{2},& &{}\end{array}$$
(13c)
$$\displaystyle\begin{array}{rcl} C_{31}x_{1} + C_{32}x_{2} \geq c_{3}.& &{}\end{array}$$
(13d)

We show that we can employ a partial null-space approach called condensing in order to efficiently solve the large-scale, structured QP (13b). Condensing is the key linear algebra ingredient for an efficient Direct Multiple Shooting method (see, e.g., [6]), but must be extended to also exploit the two-grid Newton-Picard structures. For this purpose, we prefer the following presentation of the concept of condensing.

Theorem 1 (Condensing)

Assume that C 11 in QP  (13) is invertible and define

$$\displaystyle\begin{array}{rcl} Z& =& \left (\begin{array}{*{10}c} -C_{11}^{-1}C_{12} \\ \mathbb{I}_{n_{2}} \end{array} \right ),\quad B^{{\prime}} = Z^{\mathrm{T}}BZ, {}\\ c_{1}^{{\prime}}& =& C_{ 11}^{-1}c_{ 1},\quad b^{{\prime}} = B_{ 21}c_{1}^{{\prime}} + b_{ 2} - C_{12}^{\mathrm{T}}C_{ 11}^{\mathrm{-T}}(B_{ 11}c_{1}^{{\prime}} + b_{ 1}), {}\\ c_{2}^{{\prime}}& =& c_{ 2} - C_{21}c_{1}^{{\prime}},\quad C_{ 2}^{{\prime}} = C_{ 22} - C_{21}C_{11}^{-1}C_{ 12}, {}\\ c_{3}^{{\prime}}& =& c_{ 3} - C_{31}c_{1}^{{\prime}},\quad C_{ 3}^{{\prime}} = C_{ 32} - C_{31}C_{11}^{-1}C_{ 12}. {}\\ \end{array}$$

Let furthermore \((x_{2}^{{\ast}},y_{\mathcal{E}_{2}}^{{\ast}},y_{\mathcal{I}}^{{\ast}}) \in \mathbb{R}^{n_{2}+m_{2}+m_{3}}\) be a primal-dual solution of the QP

$$\displaystyle{ \mathop{\mathrm{minimize}}\limits _{x_{2}\in \mathbb{R}^{n_{2}}}\quad \frac{1} {2}x_{2}^{\mathrm{T}}B^{{\prime}}x_{ 2} + b^{{\prime}\mathrm{T}}x_{ 2}\quad \mathop{ \mathrm{s.t.}}\limits \quad C_{2}^{{\prime}}x_{ 2} = c_{2}^{{\prime}},\quad C_{ 3}^{{\prime}}x_{ 2} \geq c_{3}^{{\prime}}. }$$
(14)

If we choose

$$\displaystyle\begin{array}{rcl} x_{1}^{{\ast}}& =& C_{ 11}^{-1}(c_{ 1} - C_{12}x_{2}^{{\ast}}),{}\end{array}$$
(15a)
$$\displaystyle\begin{array}{rcl} y_{\mathcal{E}_{1}}^{{\ast}} = C_{ 11}^{\mathrm{-T}}\left ((B_{ 12} - B_{11}C_{11}^{-1}C_{ 12})x_{2}^{{\ast}} + B_{ 11}c_{1}^{{\prime}} + b_{ 1} - C_{21}^{\mathrm{T}}y_{ \mathcal{E}_{2}}^{{\ast}}- C_{ 31}^{\mathrm{T}}y_{ \mathcal{I}}^{{\ast}}\right )& &{}\end{array}$$
(15b)

then \((x^{{\ast}},y^{{\ast}}):= (x_{1}^{{\ast}},x_{2}^{{\ast}},y_{\mathcal{E}_{1}}^{{\ast}},y_{\mathcal{E}_{2}}^{{\ast}},y_{\mathcal{I}}^{{\ast}})\) is a primal-dual solution of QP  (13) .

Proof

See, e.g., [17, Theorem 7.6].

Theorem 1 shows how the solution of the structured, large-scale QP (13) can be boiled down to a medium-scale QP (14), which only contains non-PDE variables x 2. The pre- and post-processing steps only require matrix-vector products with the large-scale blocks, e.g., n 2 matrix-vector products for the computation of C 11 −1 C 12.

5.4 Two-Grid Hessian Approximation

If we now use the approximated version of QP (13), we see that based on Sects. 5.1 and 5.2, the partial null-space basis can be evaluated purely on the coarse grid due to

$$\displaystyle{\tilde{Z} = \left (\begin{array}{*{10}c} -\tilde{C} _{11}^{-1}\tilde{C} _{ 12} \\ \mathbb{I} \end{array} \right ) = \left (\begin{array}{*{10}c} -(\mathbb{I} \otimes P)\hat{C} _{11}^{-1}(\mathbb{I} \otimes R)(\mathbb{I} \otimes P)\hat{C} _{ 12} \\ \mathbb{I} \end{array} \right ) = \left (\begin{array}{*{10}c} -(\mathbb{I} \otimes P)\hat{C} _{11}^{-1}\hat{C} _{ 12} \\ \mathbb{I} \end{array} \right ).}$$

This observation suggests a projected Newton-Picard approximation of the Hessian matrix via

$$\displaystyle\begin{array}{rcl} \tilde{B}^{\mathrm{fast}}& =& \left (\begin{array}{*{10}c} (\varPi ^{\mathrm{fast}})^{\mathrm{T}}B_{11}\varPi ^{\mathrm{fast}} & 0 \\ 0 &0 \end{array} \right ), {}\\ \tilde{B}^{\mathrm{slow}}& =& \left (\begin{array}{*{10}c} (\mathbb{I}_{n_{\text{MS}}} \otimes R)^{\mathrm{T}}\hat{B} _{ 11}(\mathbb{I}_{n_{\text{MS}}} \otimes R)&(\mathbb{I}_{n_{\text{MS}}} \otimes R)^{\mathrm{T}}\hat{B} _{ 12} \\ \hat{B} _{21}(\mathbb{I}_{n_{ \text{MS}}} \otimes R) & \hat{B} _{22} \end{array} \right ), {}\\ \tilde{B}& =& \tilde{B}^{\mathrm{fast}} + \tilde{B} ^{\mathrm{slow}}. {}\\ \end{array}$$

Consequently, we have

$$\displaystyle{\tilde{Z} ^{\mathrm{T}}\tilde{B} ^{\mathrm{fast}}\tilde{Z} = 0}$$

and thus we can also compute the condensed two-grid Newton-Picard Hessian matrix purely on the coarse grid according to

$$\displaystyle{\tilde{B}^{{\prime}} = \tilde{Z} ^{\mathrm{T}}\tilde{B}\tilde{Z} = \hat{Z} ^{\mathrm{T}}\hat{B}\hat{Z} \quad \text{with }\hat{Z} = \left (\begin{array}{*{10}c} -\hat{C} _{11}^{-1}\hat{C} _{12} \\ \mathbb{I}_{n_{\mathrm{c}}} \end{array} \right ).}$$

Thus, it is possible to set up QP (14) efficiently with only a few fine-grid operations.

6 Numerical Examples

We now report on the numerical results for Direct Multiple Shooting applied to tracking problems for an advection-diffusion equation in 2D and a bacterial chemotaxis example in 1D. Further examples can be found in [17], including a real-world separation process from chemical engineering with economic objective and constraint functions. All computational results were computed on a Linux workstation with four 2.67 GHz Intel Core i7 cores and 24 GB of RAM.

6.1 Advection-Diffusion Equation

We consider an advection-diffusion equation on \(\varOmega \subset (-1,1)^{2}\), whose boundary ∂ Ω is partitioned into disjoint sets Γ i , i = 0, 1, 2, 3. Let ν denote the outwards pointing normal on ∂ Ω and let \(\boldsymbol{U} \in L^{2}(\varOmega )^{2}\) be a divergence-free velocity field. For \(\gamma,\varepsilon,T > 0\), inflow profiles u i in ∈ L 2(Γ i ), i = 1, 2, and a desired profile \(\hat{u}\in L^{2}(\varOmega )\), we apply the proposed method to the following periodic tracking-type boundary-control problem with control gradient constraints

$$\displaystyle\begin{array}{rcl} & & \mathop{\mathrm{minimize}}\limits _{\begin{array}{c}u\in W(0,T) \\ q\in H^{1}(0,T)^{2}\end{array}}\qquad \frac{1} {2}\left \Vert u(T,.) -\hat{u} \right \Vert _{L^{2}(\varOmega )}^{2} + \frac{\gamma } {2}\big(\left \Vert q_{1}\right \Vert _{H^{1}(0,T)}^{2} + \left \Vert q_{ 2}\right \Vert _{H^{1}(0,T)}^{2}\big){}\end{array}$$
(16a)
$$\displaystyle\begin{array}{rcl} \mathop{\mathrm{s.t.}}\limits \qquad \partial _{t}u =\varepsilon \nabla \cdot \nabla u -\boldsymbol{ U} \cdot \nabla u,\qquad \text{in}\varOmega,& &{}\end{array}$$
(16b)
$$\displaystyle\begin{array}{rcl} u(0,.) = u(T,.),\qquad \text{in}\varOmega,& &{}\end{array}$$
(16c)
$$\displaystyle\begin{array}{rcl} \nu \cdot \left (\varepsilon \nabla u -\boldsymbol{ U}u\right ) = 0,\qquad \text{on}\varGamma _{0},& &{}\end{array}$$
(16d)
$$\displaystyle\begin{array}{rcl} \nu \cdot \left (\varepsilon \nabla u -\boldsymbol{ U}u\right ) = u_{i}^{\mathrm{in}}q_{ i} - u,\qquad \text{on}\varGamma _{i},i = 1,2,& &{}\end{array}$$
(16e)
$$\displaystyle\begin{array}{rcl} \nu \cdot \left (\varepsilon \nabla u -\boldsymbol{ U}u\right ) = -u,\qquad \text{on}\varGamma _{3},& &{}\end{array}$$
(16f)
$$\displaystyle\begin{array}{rcl} q_{i}(0) = q_{i}(T),& & \text{for}i = 1,2,{}\end{array}$$
(16g)
$$\displaystyle\begin{array}{rcl} q_{i}(t) \geq 0,& & \text{for}t \in [0,T],i = 1,2,{}\end{array}$$
(16h)
$$\displaystyle\begin{array}{rcl} \frac{\mathrm{d}q_{i}} {\mathrm{d}t} (t) \in [-20,20],& & \text{for a.a.}t \in [0,T],i = 1,2.{}\end{array}$$
(16i)

The setting q i  ∈ H 1(0, T) can be reformulated with L 2(0, T) controls by introduction of ODE states v i , whose time-derivative we control, and adequate adaption of all occurrences of q i in (16).

We obtain a suitable velocity field \(\boldsymbol{U}\) as the solution of a Stokes problem with parabolic inflow velocities on Γ 1 = { −1} × (−1, −1∕3) and Γ 2 = { 1} × (−1, −1∕3) and free outflow velocity on Γ 3 = (−1∕3, 1∕3) ×{ 1} (compare Fig. 1)

$$\displaystyle\begin{array}{rcl} -\nabla \cdot \nabla \boldsymbol{U} + \nabla p& =& 0,\quad \text{in}\varOmega, {}\\ \nabla \cdot \boldsymbol{ U}& =& 0,\quad \text{in }\varOmega, {}\\ \boldsymbol{U}(x)& =& 0,\quad \text{for }x \in \varGamma _{0}, {}\\ \boldsymbol{U}(x)& =& (9(x_{2} + 1/3)(x_{2} + 1),0),\quad \text{for }x \in \varGamma _{1}, {}\\ \boldsymbol{U}(x)& =& (-9(x_{2} + 1/3)(x_{2} + 1),0),\quad \text{for }x \in \varGamma _{2}. {}\\ \end{array}$$

We use Finite Differences on equidistant, staggered grids for the discretization of the Stokes problem. This discretization is compatible with an upwind-flux Finite Volume method for the advection-diffusion equation, because the velocity degrees of freedom align with the center of the volume interfaces.

Fig. 1
figure 1

Divergence-free velocity field obtained from the solution of a Stokes problem on an inverted-T-shaped domain with inflow from the left on Γ 1, inflow from the right on Γ 2 and outflow on top on Γ 3

We compute optimal controls for \(T = 10, \hat{u} (x) = 1 + x_{1},u_{i}^{\mathrm{in}}(x) =\exp (-10(x_{2} + 2/3)^{2})),i = 1,2,\) and for varying values of \(\varepsilon = 10^{0},10^{-1},10^{-2},5 \cdot 10^{-3}\) and γ = 10−2, 10−3 on n MS = 20 shooting intervals. We use a 5-level hierarchy of grids with mesh sizes h = 2∕15, 1∕15, 1∕30, 1∕60, 1∕120, giving rise to 125, 500, 2000, 8000, and 32,000 degrees of freedom on the respective level. As initial values, we chose \(\boldsymbol{u}^{i} = 0,\boldsymbol{v}^{i} = (0.1,0.1)^{\mathrm{T}},\boldsymbol{q}^{i} = (0.01,0.01)^{\mathrm{T}}\) for \(i = 0,\mathop{\ldots },20\).

One purpose of treating an advection-diffusion problem here is to fathom the limits of the Newton-Picard approach, which exploits dispersion numerically. We can expect the problem to become more difficult to solve as we approach the hyperbolic limit \(\varepsilon \rightarrow 0\). Because the upwind discretization achieves numerical stability by introduction of spurious numerical dispersion, the discretized problem exhibits more diffusion on coarser grid levels where the effect of numerical dispersion is more pronounced. Thus, the efficiency of the diffusion-exploiting Newton-Picard preconditioners is slightly better for coarser fine grids. We also want to remark that problem (16) is a linear-quadratic problem. We tackle it with a nonlinear solver and thus the reader should keep in mind that numerical linear algebra approaches (e.g., along the lines of Potschka et al. [20]) are more efficient, because they exploit linearity explicitly.

From the computational results in Table 1, we can observe the expected trend in the number of overall inexact SQP iterations that is growing for decreasing values of the diffusion coefficient \(\varepsilon\). The increase is higher for lower regularization parameters γ. The number of inexact SQP iterations with the fine grid on the finest level grows only slightly. For high diffusion \(\varepsilon = 10^{0}\), level c = 1 for the coarse grid is already enough for sufficient local contraction on all fine grid levels f ≤ 5. For \(\varepsilon = 10^{-1}\), c = 1 is still sufficiently fine for a high regularization parameter γ = 10−1, but not for γ = 10−3. For \(\varepsilon = 10^{-2},5 \cdot 10^{-3}\), the κ-estimators trigger a refinement of the coarse level to c = 2 in order to obtain local contraction of the method on all fine grid levels f ≤ 5. For \(\varepsilon = 10^{-3}\), the coarse grid needs to be refined to c = 3. In this case, the memory consumption of DAESOL-II on the coarse and fine grids exceeds the available RAM space. This problem, however, is a momentary technical restriction of the software implementation and not a fundamental restriction of the method.

Table 1 The number of inexact SQP iterations and the computation times depend on the diffusion coefficient \(\varepsilon\) and the regularization parameter γ

We can also observe in Table 1 that the number of adaptive time-steps for the initial value problem solver increases with decreasing diffusion coefficient γ. From the number of required time-steps we see that the time discretization has a higher resolution than the spatial discretization. The reason for this imbalance in accuracy are stability requirements for the solution of the initial value problems in (3c) and its first and second order derivatives as discussed in [17, Chap. 9, Sect. 3]. If we take into account the intermediate time steps, we obtain on the finest level between 2. 1 ⋅ 107 and 3. 0 ⋅ 107 state variables, although NLP (3) really only has 6. 7 ⋅ 105 degrees of freedom in the PDE state variables \(\boldsymbol{u}^{i},i = 0,\mathop{\ldots },n_{\text{MS}},\) (independently of \(\varepsilon\)).

If we compare the serial simulation time for the numerical solution of the initial value problems for fixed initial values and fixed controls with the wall-clock time for the inexact SQP method parallelized on four cores, we obtain a ratio of 27–95. This is quite remarkable, because already for the forward problem of finding a time-periodic solution for fixed controls, several Newton-Picard iterations (in the sense of Lust et al. [13]) would be required.

For completeness, we depict the optimal control and the corresponding optimal final state for the case \(\varepsilon = 10^{-2}\) and γ = 10−3 in Figs. 2 and 3. We can observe that most of the control action happens at Γ 2, where high values of u are to be tracked. The control gradient constraint (16i) is active at the slopes of the initial peak of q 2. Furthermore, the optimal state at the end of the period matches the desired profile \(\hat{u}\) well except for the area around the inflow boundary Γ 2.

Fig. 2
figure 2

The optimal controls for problem (16) with \(\varepsilon = 10^{-2},\gamma = 10^{-3}\) work with different magnitudes. The control gradient constraint is active for q 2 at the slopes of the initial peak

Fig. 3
figure 3

For \(\varepsilon = 10^{-2},\gamma = 10^{-3}\), the black isolines of the optimal final state u(T, . ) match the gray isolines of the target well except for the south-east part of the domain. A boundary layer is visible at the northern outflow Γ 3

6.2 Bacterial Chemotaxis

On the basis of a chemotaxis model by Tyson et al. [26, 27], we consider on Ω = (0, 1) with \(\partial \varOmega =\varGamma _{1} \cup \varGamma _{2} =\{ 0\} \cup \{ 1\}\) the nonlinear tracking-type boundary control problem (compare also [12])

$$\displaystyle\begin{array}{rcl} & & \mathop{\mathrm{minimize}}\limits _{\begin{array}{c}z,c\in W(0,1) \\ q_{1},q_{2}\in L^{2}(0,1)\end{array}}\quad \frac{1} {2}\int _{\varOmega }(z(1,\cdot ) -\hat{z} )^{2} + \frac{\gamma _{c}} {2}\int _{\varOmega }(c(1,\cdot ) -\hat{c} )^{2} + \frac{\gamma _{q}} {2}\int _{0}^{1}\left (q_{ 1}^{2} + q_{ 2}^{2}\right )\phantom{}{}\end{array}$$
(17a)
$$\displaystyle\begin{array}{rcl} \mathop{\mathrm{s.t.}}\limits \qquad \partial _{t}z = \nabla \cdot \left (D_{z}\nabla z -\alpha \frac{z} {(1 + c)^{2}}\nabla c\right ),\qquad \text{in }(0,1)\times \varOmega,& &{}\end{array}$$
(17b)
$$\displaystyle\begin{array}{rcl} \partial _{t}c = \nabla \cdot \nabla c + w \frac{z^{2}} {(\mu +z^{2})} -\rho c\qquad \text{in }(0,1)\times \varOmega,& &{}\end{array}$$
(17c)
$$\displaystyle\begin{array}{rcl} & & \partial _{\nu }z = \frac{\alpha \beta (q_{i} - c)z} {D_{z}(1 + c)^{2}},\quad \partial _{\nu }c =\beta (q_{i} - c),\qquad \text{in }(0,1) \times \varGamma _{i},i = 1,2,{}\end{array}$$
(17d)
$$\displaystyle\begin{array}{rcl} z(0,.) = z_{0},\quad c(0,.) = c_{0},& &{}\end{array}$$
(17e)
$$\displaystyle\begin{array}{rcl} q_{i}(t) \in [0,1],\qquad \text{for a.a. }t \in (0,1),& &{}\end{array}$$
(17f)

where ν denotes the derivative in direction of the outwards pointing normal on Ω. The parameters considered here are D z  = 0. 3, α = 4, β = 1, w = 0. 1, μ = 10, ρ = 0. 1, γ c  = 10−2, γ q  = 10−5. The targets to be tracked are \(\hat{z} (x) = 2x, \hat{c} (x) = 0\) and the initial states are given as z 0(x) = 1, c 0(x) = 0.

On the basis of a Finite Difference discretization on a six-level hierarchy of nested grids with 322, 642, 1282, 2562, 5122, 10,242 degrees of freedom for the PDE states, we can use the proposed Direct Multiple Shooting method to obtain the optimal controls and states depicted in Fig. 4. Level c = 1 for the coarse level is sufficient to yield good local contraction for the inexact SQP method. In the solution, DAESOL-II uses 889 integration adaptive steps for the solution of the initial value problems on all 20 shooting intervals together. In this case, the spatial and temporal discretization errors are well balanced, because a high-order method is used in time.

Fig. 4
figure 4

Left: the desired cell distribution \(\hat{z}\) (dotted line) can be reached well with the optimal final cell distribution z(1, . ) (solid line). The final chemoattractor concentration c(1, . ) is almost affine-linear (dashed line). Right: the control bound constraints are active most of the time for q 2. Not shown: the optimal control q 1 is always 0

In Table 2 we sum up the cumulative CPU time needed for the computation of values concerning the initial value problems on each spatial discretization level. This is the main part of the computational burden. The solution of all the condensed QPs (14) only takes 0.6 s, for instance. We observe that most of the effort is spent on levels 1 and 6. The effort on level 1 is due to the coarse grid computations in each iteration. Due to DAESOL-II’s memory requirements, we could not refine the fine grid further. Were we able to do so, then the numerical effort spent on the finest grid level  = 7 would dominate the overall effort even more.

Table 2 Cumulative CPU time in seconds for the chemotaxis example on each spatial discretization level for computations concerning the dynamic system, including system integration, forward and adjoint derivatives, and matrix-vector products with Hessian matrices

7 Conclusions

We elaborated on the challenges and a numerical approach for a Direct Multiple Shooting method for optimal control problems with coupled ODE and parabolic PDE constraints. Even difficult boundary conditions like time periodicity of the states can be treated efficiently in this framework. The large-scale nature of the resulting NLPs can be tamed by a structure exploiting two-grid Newton-Picard inexact SQP method. Encouraging numerical results indicate that challenging PDE constrained optimal control problems can be solved efficiently with a Direct Multiple Shooting method. The proposed method does not require the often time-consuming derivation of adjoint equations and can be easily parallelized. It is in particular suited for practitioners who need an accurate resolution of the system dynamics and can live with lower resolution of the optimal control.

We would like to thank the unknown reviewer for the constructive comments on the initial manuscript. Support by the EU through S. Engell’s and H.G. Bock’s ERC Advanced Investigator Grant MOBOCON (291 458) is gratefully acknowledged.