Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Preface

It was in the year 1988. My contract at the Sonderforschungsbereich 123 of the University of Heidelberg as research assistant was about to expire and could not be prolonged. My supervisor at that time was W. Jäger. While I was searching a new position, aiming at the possibility to earn a habilitation, he organized a fellowship at the research center of IBM in Heidelberg. I seized this opportunity and signed a contract for 9 months. On the first day of the new job, I met two other colleagues starting at the same day. One of them had a permanent contract. The other one was Volker Mehrmann who was on leave from the University of Bielefeld to spend the same 9 months at the research center of IBM. Our common head R. Janßen put us three into the same office. This was the beginning of Volker’s and my joint venture. I therefore want to express my sincere thanks to W. Jäger and R. Janßen for their support which brought me in contact with Volker.

2 Introduction

Differential-algebraic equations (DAEs) arise if physical systems are modeled that contain constraints restricting the possible states of the systems. Moreover, in modern hierarchical modeling tools like [5], even if the submodels are ordinary differential equations (ODEs), the equations describing how the submodels are linked yield DAEs as overall models.

The general form of a DAE is given by

$$\displaystyle{ F(t,x,\dot{x}) = 0, }$$
(16.1)

with \(F \in C(\mathbb{I} \times \mathbb{D}_{x} \times \mathbb{D}_{\dot{x}}, \mathbb{R}^{m})\) sufficiently smooth, \(\mathbb{I} \subseteq \mathbb{R}\) (compact) interval, and \(\mathbb{D}_{x}, \mathbb{D}_{\dot{x}} \subseteq \mathbb{R}^{n}\) open. In this paper, we will not assume any further structure of the equations. It should, however, be emphasized that additional structure should, if possible, be utilized in the numerical treatment when efficiency is an issue. On the other hand, a general approach is of advantage when it is desirable to have no restrictions in the applicability of the numerical procedure.

It is the aim of the present paper to give an overview of the relevant theory of general unstructured nonlinear DAEs with arbitrary index and its impact on the design of numerical techniques for their approximate solution. We will concentrate mainly on the quadratic case, i.e., on the case m = n, but also address the overdetermined case m ≥ n assuming consistency of the equations. The attractivity of the latter case lies in the fact that we may add known properties of the solution like first integrals to the system, thus enforcing that the generated numerical solution will respect these properties as well. In the discussion of numerical techniques, we focus on two families of Runge-Kutta type one-step methods and the development of appropriate techniques for the solution of the arising nonlinear systems. Besides the mentioned issues on DAE techniques for treating first integrals, we include a discussion on numerical path following and turning point determination in the area of parametrized nonlinear equations, which can also be treated in the context of DAEs combined with root finding. Several examples demonstrate the performance of the presented numerical approaches.

The paper is organized as follows. In Sect. 16.3, we give an overview of the analysis of unstructured regular nonlinear DAEs of arbitrary index. In particular, we present existence and uniqueness results. We discuss how these results can be extended to overdetermined consistent DAEs, thus allowing for the treatment of known first integrals. Section 16.4 is then dedicated to various computational issues. We first present possible one-step methods, develop Gauß-Newton like processes for the treatment of the arising nonlinear systems, which includes a modification to stabilize the numerical solution. After some remarks on the use of automatic differentiation, we show how problems with first integrals and parametrized nonlinear equations can be treated in the context of DAEs. We close with some conclusions in Sect. 16.5.

3 Theory of Nonlinear DAEs

Dealing with nonlinear problems, the first step is to require a suitable kind of regularity. In the special case of an ODE \(\dot{x} = f(t,x)\), obviously no additional properties besides smoothness must be required to obtain (local) existence and uniqueness of solutions for the corresponding initial value problem. In the special case of a pure algebraic (parametrized) system F(t, x) = 0, the typical requirement is given by assuming that F x (t, x), denoting the Jacobian of F with respect to x, is nonsingular for all relevant arguments. The regularity then corresponds to the applicability of the implicit function theorem allowing to (locally) solve for x in terms of t. In the general case of DAEs, we of course want to include these extreme cases into the definition of a regular problem. Moreover, we want to keep the conditions as weak as possible. The following example gives an idea, how the conditions for regularity should look like.

Example 1

The system

$$\displaystyle{\begin{array}{rl} \dot{x}_{1} & = x_{4},\quad \dot{x}_{4} = 2x_{1}x_{7},\qquad \qquad \qquad \qquad \\ \dot{x}_{2} & = x_{5},\quad \dot{x}_{5} = 2x_{2}x_{7}, \\ \dot{x}_{3} & = x_{6},\quad \dot{x}_{6} = -1 - x_{7}, \\ 0& = x_{3} - x_{1}^{2} - x_{2}^{2},\end{array} }$$

see [16], describes the movement of a mass point on a paraboloid under the influence of gravity.

Differentiating the constraint twice and eliminating the arising derivatives of the unknowns yields

$$\displaystyle{\begin{array}{rl} 0& = x_{6} - 2x_{1}x_{4} - 2x_{2}x_{5}, \\ 0& = -1 - x_{7} - 2x_{4}^{2} - 4x_{1}^{2}x_{7} - 2x_{5}^{2} - 4x_{2}^{2}x_{7}. \end{array} }$$

In particular, the so collected three constraints can be solved for x 3, x 6, and x 7 in terms of the other unknowns, leaving, if eliminated, ODEs for these other unknowns. Hence, we may replace the original problem by

$$\displaystyle{\begin{array}{rl} \dot{x}_{1} & = x_{4},\quad \dot{x}_{4} = 2x_{1}x_{7},\qquad \qquad \qquad \qquad \\ \dot{x}_{2} & = x_{5},\quad \dot{x}_{5} = 2x_{2}x_{7}, \\ 0& = x_{3} - x_{1}^{2} - x_{2}^{2}, \\ 0& = x_{6} - 2x_{1}x_{4} - 2x_{2}x_{5}, \\ 0& = -1 - x_{7} - 2x_{4}^{2} - 4x_{1}^{2}x_{7} - 2x_{5}^{2} - 4x_{2}^{2}x_{7}. \end{array} }$$

From this example, we deduce the following. The solution process may require to differentiate part of the equations such that the solution may depend on the derivatives of the data. Without assuming structure, it is not known in advance which equations should be differentiated. By the differentiation process, we obtain additional constraints that must be satisfied by a solution.

3.1 A Hypothesis

In order to include differentiated data, we follow an idea of Campbell, see [1], and define so-called derivative array equations

$$\displaystyle{ F_{\ell}(t,x,\dot{x},\ddot{x},\ldots,x^{(\ell+1)}) = 0, }$$
(16.2)

where the functions \(F_{\ell} \in C(\mathbb{I} \times \mathbb{D}_{x} \times \mathbb{D}_{\dot{x}} \times \mathbb{R}^{n} \times \cdots \times \mathbb{R}^{n}, \mathbb{R}^{(l+1)m})\) are defined by stacking the original function F together with its formal time derivatives up to order , i.e.,

$$\displaystyle{ F_{\ell}(t,x,\dot{x},\ddot{x},\ldots,x^{(\ell+1)}) = \left [\begin{array}{c} F(t,x,\dot{x}) \\ \frac{d} {\mathit{dt}}F(t,x,\dot{x})\\ \vdots \\ ( \frac{d} {\mathit{dt}})^{\ell}F(t,x,\dot{x}) \end{array} \right ]. }$$
(16.3)

Jacobians of F l with respect to the selected variables x, y will be denoted by F l; x, y in the following. A similar notation will be used for other functions.

The desired regularity condition should include that the original DAE implies a certain number of constraints, that these constraints should be independent, and that given an initial value satisfying these constraints can always be extended to a local solution. In the case m = n, this leads to the following hypothesis.

Hypothesis 1

There exist (nonnegative) integers μ, a, and d such that the set

$$\displaystyle{ \qquad \mathbb{L}_{\mu } =\{ (t,x,y) \in \mathbb{R}^{(\mu +2)n+1}\mid F_{\mu }(t,x,y) = 0\} }$$
(16.4)

associated with F is nonempty and such that for every point \((t_{0},x_{0},y_{0}) \in \mathbb{L}_{\mu }\) , there exists a (sufficiently small) neighborhood  \(\mathbb{V}\) in which the following properties hold:

  1. 1.

    We have \(\mathop{\mathrm{rank}}\nolimits F_{\mu;y} = (\mu +1)n - a\) on  \(\mathbb{L}_{\mu } \cap \mathbb{V}\) such that there exists a smooth matrix function Z 2 of size ((μ + 1)n,a) and pointwise maximal rank, satisfying \(Z_{2}^{T}F_{\mu;y}^{} = 0\) on \(\mathbb{L}_{\mu } \cap \mathbb{V}\) .

  2. 2.

    We have \(\mathop{\mathrm{rank}}\nolimits Z_{2}^{T}F_{\mu;x} = a\) on  \(\mathbb{V}\) such that there exists a smooth matrix function T 2 of size (n,d), \(d = n - a\) , and pointwise maximal rank, satisfying \(Z_{2}^{T}F_{\mu;x}T_{2} = 0\) .

  3. 3.

    We have \(\mathop{\mathrm{rank}}\nolimits F_{\dot{x}}T_{2} = d\) on  \(\mathbb{V}\) such that there exists a smooth matrix function Z 1 of size (n,d) and pointwise maximal rank, satisfying \(\mathop{\mathrm{rank}}\nolimits Z_{1}^{T}F_{\dot{x}}T_{2} = d\) .

Note that the local existence of functions Z 2, T 2, Z 1 is guaranteed by the following theorem, see, e.g., [13, Theorem 4.3]. Moreover, it shows that we may assume that they possess (pointwise) orthonormal columns.

Theorem 1

Let \(E \in C^{\ell}(\mathbb{D}, \mathbb{R}^{m,n})\) , \(\ell\in \mathbb{N}_{0} \cup \{\infty \}\) , and assume that \(\mathop{\mathrm{rank}}\nolimits E(x) = r\) for all \(x \in \mathbb{M} \subseteq \mathbb{D}\) , \(\mathbb{D} \subseteq \mathbb{R}^{k}\) open. For every \(\hat{x} \in \mathbb{M}\) there exists a sufficiently small neighborhood \(\mathbb{V} \subseteq \mathbb{D}\) of  \(\hat{x}\) and matrix functions \(T \in C^{\ell}(\mathbb{V}, \mathbb{R}^{n,n-r})\) , \(Z \in C^{\ell}(\mathbb{V}, \mathbb{R}^{m,m-r})\) , with pointwise orthonormal columns such that

$$\displaystyle{ \mathit{ET} = 0,\quad Z^{T}E = 0 }$$
(16.5)

on \(\mathbb{M}\) .

The quantity μ denotes how often we must differentiate the original DAE in order to be able to make conclusions about existence and uniqueness of solutions. Typically, such a quantity is called index. To distinguish it from other indices, the quantity μ, if chosen minimally, is called strangeness index of the given DAE.

For linear DAEs, the above hypothesis is equivalent (for sufficiently smooth data) to the assumption of a well-defined differentiation index and thus to regularity of the given linear DAE, see [13]. In the nonlinear case, the hypothesis, of course, should imply some kind of regularity of the given problems.

In the following, we say that F satisfies Hypothesis 1 with (μ, a, d), if Hypothesis 1 holds with the choice μ, a, and d for the required integers.

3.2 Implications

In order to show that Hypothesis 1 implies a certain kind of regularity for the given DAE, we revise the approach first given in [12], see also [13].

Let \((t_{0},x_{0},y_{0}) \in \mathbb{L}_{\mu }\) and

$$\displaystyle{T_{2,0} = T_{2}(t_{0},x_{0},y_{0}),\quad Z_{1,0} = Z_{1}(t_{0},x_{0},y_{0}),\quad Z_{2,0} = Z_{2}(t_{0},x_{0},y_{0}).}$$

Furthermore, let \(Z_{2,0}^{{\prime}}\) be chosen such that \([ Z_{2,0}^{{\prime}} Z_{2,0}^{} ]\) is orthogonal. By Hypothesis 1, the matrices \(Z_{2,0}^{T}F_{\mu;x}(t_{0},x_{0},y_{0})\) and \(Z_{2,0}^{{\prime}T}F_{\mu;y}(t_{0},x_{0},y_{0})\) have full row rank. Thus, we can split the variables x and y, without loss of generalization according to x = (x 1, x 2) and y = (y 1, y 2), such that \(Z_{2,0}^{T}F_{\mu;x_{2}}(t_{0},x_{0},y_{0})\) and \(Z_{2,0}^{{\prime}T}F_{\mu;y_{2}}(t_{0},x_{0},y_{0})\) are nonsingular. Because of

$$\displaystyle{\mathop{\mathrm{rank}}\nolimits F_{\mu;x_{2},y_{2}} =\mathop{ \mathrm{rank}}\nolimits \left [\begin{array}{cc} Z_{2,0}^{{\prime}T}F_{\mu;x_{2}} & Z_{2,0}^{{\prime}T}F_{\mu;y_{2}} \\ Z_{2,0}^{T}F_{\mu;x_{2}} & Z_{2,0}^{T}F_{\mu;y_{2}} \end{array} \right ]}$$

and \(Z_{2,0}^{T}F_{\mu;y_{2}}(t_{0},x_{0},y_{0}) = 0\), this implies that \(F_{\mu;x_{2},y_{2}}(t_{0},x_{0},y_{0})\) is nonsingular. The implicit function theorem then yields that the equation \(F_{\mu }(t,x_{1},x_{2},y_{1},y_{2}) = 0\) is locally solvable for x 2 and y 2. Hence, there are locally defined functions \(\mathcal{G}\) and \(\mathcal{H}\) with

$$\displaystyle{ F_{\mu }(t,x_{1},\mathcal{G}(t,x_{1},y_{1}),y_{1},\mathcal{H}(t,x_{1},y_{1})) \equiv 0, }$$
(16.6)

implying the following structure of \(\mathbb{L}_{\mu }\).

Theorem 2

The set  \(\mathbb{L}_{\mu }\) forms a manifold of dimension n + 1 that can be locally parametrized by variables (t,x 1 ,y 1 ), where x 1 consists of d variables from x and y 1 consists of a variables from y.

In order to examine the implicitly defined functions in more detail, we consider the system of nonlinear equations H(t, x, y, α) = 0 with \(\alpha \in \mathbb{R}^{a}\) given by

$$\displaystyle{ H(t,x,y,\alpha ) = \left [\begin{array}{c} F_{\mu }(t,x,y) - Z_{2,0}\alpha \\ T_{1,0}^{T}(y - y_{0}) \end{array} \right ], }$$
(16.7)

where the columns of T 1, 0 form an orthonormal basis of \(\mathop{\mathrm{kernel}}\nolimits F_{\mu;y}(t_{0},x_{0},y_{0})\). Obviously, we have that H(t 0, x 0, y 0, 0) = 0. Choosing \(T_{1,0}^{{\prime}}\) such that \([ T_{1,0}^{{\prime}} T_{1,0}^{} ]\) is orthogonal, we get

$$\displaystyle{\begin{array}{l} \mathop{\mathrm{rank}}\nolimits H_{y,\alpha } =\mathop{ \mathrm{rank}}\nolimits \left [\begin{array}{cc} F_{\mu;y} & - Z_{2,0} \\ T_{1,0}^{T}& 0 \end{array} \right ] =\mathop{ \mathrm{rank}}\nolimits \left [\begin{array}{ccc} Z_{2,0}^{{\prime}T}F_{\mu;y}T_{1,0}^{{\prime}}&Z_{2,0}^{{\prime}T}F_{\mu;y}T_{1,0}^{}& {\ast} \\ Z_{2,0}^{T}F_{\mu;y}T_{1,0}^{{\prime}}&Z_{2,0}^{T}F_{\mu;y}T_{1,0}^{}& - I_{a} \\ {\ast} & I_{d} & 0 \end{array} \right ], \end{array} }$$

where here and in the following I k denotes the identity matrix in \(\mathbb{R}^{k,k}\) and its counterpart as constant matrix function. It follows that

$$\displaystyle{\mathop{\mathrm{rank}}\nolimits H_{y,\alpha }(t_{0},x_{0},y_{0},0) =\mathop{ \mathrm{rank}}\nolimits \left [\begin{array}{ccc} Z_{2,0}^{{\prime}T}F_{\mu;y}(t_{0},x_{0},y_{0})T_{1,0}^{{\prime}}& 0 & 0 \\ 0 & 0 & - I_{a} \\ 0 &I_{d}& 0 \end{array} \right ]}$$

and \(H_{y,\alpha }(t_{0},x_{0},y_{0},0)\) is nonsingular because \(Z_{2,0}^{{\prime}T}F_{\mu;y}(t_{0},x_{0},y_{0})T_{1,0}^{{\prime}}\), representing the linear map obtained by the restriction of \(F_{\mu;y}(t_{0},x_{0},y_{0})\) to the linear map from its cokernel onto its range, is nonsingular. Thus, the nonlinear equation (16.7) is locally solvable with respect to (y, α), i.e., there are locally defined functions \(\hat{F}_{2}\) and \(\mathcal{Y}\) such that

$$\displaystyle{ F_{\mu }(t,x,\mathcal{Y}(t,x)) - Z_{2,0}\hat{F}_{2}(t,x) \equiv 0,\quad T_{1,0}^{T}(\mathcal{Y}(t,x) - y_{ 0}) \equiv 0. }$$
(16.8)

If we then define \(\hat{F}_{1}\) by

$$\displaystyle{ \hat{F}_{1}(t,x,\dot{x}) = Z_{1,0}^{T}F(t,x,\dot{x}), }$$
(16.9)

we obtain a DAE

$$\displaystyle{ \begin{array}{ll} \hat{F}_{1}(t,x,\dot{x}) = 0,\qquad &\mbox{ ($d$ differential equations)} \\ \hat{F}_{2}(t,x) = 0, &\mbox{ ($a$ algebraic equations)} \end{array} }$$
(16.10)

whose properties shall be investigated.

Differentiating (16.8) with respect to x gives

$$\displaystyle{F_{\mu;x} + F_{\mu;y}\mathcal{Y}_{x} - Z_{2,0}\hat{F}_{2;x} = 0.}$$

Multiplying with \(Z_{2,0}^{T}\) form the left and evaluating at (t 0, x 0) then yields

$$\displaystyle{\hat{F}_{2;x}(t_{0},x_{0}) = Z_{2,0}^{T}F_{\mu;x}(t_{0},x_{0},y_{0}).}$$

With the above splitting for x, we have that \(\hat{F}_{2}(t_{0},x_{0}) = 0\) due to the construction of \(\hat{F}_{2}\) and \(\hat{F}_{2;x_{2}}(t_{0},x_{0})\) being nonsingular due to the choice of the splitting. Hence, we can apply the implicit function theorem once more to obtain a locally defined function \(\mathcal{R}\) satisfying

$$\displaystyle{ \hat{F}_{2}(t,x_{1},\mathcal{R}(t,x_{1})) \equiv 0. }$$
(16.11)

In particular, the set \(\mathbb{M} =\hat{ F}_{2}^{-1}(\{0\})\) forms a manifold of dimension d + 1.

Lemma 1

Let \((t_{0},x_{0},y_{0}) \in \mathbb{L}_{\mu }\) . Then there is a neighborhood of (t 0 ,x 0 ,y 0 ) such that

$$\displaystyle{ \mathcal{R}(t,x_{1}) = \mathcal{G}(t,x_{1},y_{1}) }$$
(16.12)

for all (t,x,y) in this neighborhood.

Proof

We choose the neighborhood of (t 0, x 0, y 0) to be a ball with center (t 0, x 0, y 0) and sufficiently small radius. In particular, we assume that all implicitly defined functions can be evaluated for the stated arguments.

Differentiating (16.6) with respect to y 1 gives

$$\displaystyle{F_{\mu;x_{2}}\mathcal{G}_{y_{1}} + F_{\mu;y_{1}} + F_{\mu;y_{2}}\mathcal{H}_{y_{1}} = 0,}$$

where we omitted the argument \((t_{1},x_{1},\mathcal{G}(t,x_{1},y_{1}),y_{1},\mathcal{H}(t,x_{1},y_{1}))\). If we multiply this with \(Z_{2}(t_{1},x_{1},\mathcal{G}(t,x_{1},y_{1}),y_{1},\mathcal{H}(t,x_{1},y_{1}))^{T}\), defined according to Hypothesis 1, we get \(Z_{2}^{T}F_{\mu;x_{2}}\mathcal{G}_{y_{1}} = 0\). Since \(Z_{2}^{T}F_{\mu;x_{2}}\) is nonsingular for a sufficiently small radius of the neighborhood, it follows that \(\mathcal{G}_{y_{1}}(t,x_{1},y_{1}) = 0\).

Inserting \(x_{2} = \mathcal{R}(t,x_{1})\) into the first relation of (16.8) and splitting \(\mathcal{Y}\) according to y, we obtain

$$\displaystyle{F_{\mu }(t,x_{1},\mathcal{R}(t,x_{1}),\mathcal{Y}_{1}(t,x_{1},\mathcal{R}(t,x_{1})),\mathcal{Y}_{2}(t,x_{1},\mathcal{R}(t,x_{1}))) = 0.}$$

Comparing with (16.6), this yields

$$\displaystyle{\mathcal{R}(t,x_{1}) = \mathcal{G}(t,x_{1},\mathcal{Y}_{1}(t,x_{1},\mathcal{R}(t,x_{1}))).}$$

With this, we further obtain, setting \(\tilde{y}_{1} = \mathcal{Y}_{1}(t,x_{1},\mathcal{R}(t,x_{1}))\) for short, that

$$\displaystyle{\begin{array}{l} \mathcal{G}(t,x_{1},y_{1}) -\mathcal{R}(t,x_{1}) = \mathcal{G}(t,x_{1},y_{1}) -\mathcal{G}(t,x_{1},\tilde{y}_{1}) \\ = \mathcal{G}(t,x_{1},\tilde{y}_{1} + s(y_{1} -\tilde{ y}_{1}))\vert _{0}^{1} \\ =\int _{ 0}^{1}\mathcal{G}_{y_{1}}(t,x_{1},\tilde{y}_{1} + s(y_{1} -\tilde{ y}_{1}))(y_{1} -\tilde{ y}_{1})\,ds = 0. \end{array} }$$

 □ 

With the help of Lemma 1, we can simplify the relation (16.6) to

$$\displaystyle{ F_{\mu }(t,x_{1},\mathcal{R}(t,x_{1}),y_{1},\mathcal{H}(t,x_{1},y_{1})) \equiv 0. }$$
(16.13)

Theorem 3

Consider a sufficiently small neighborhood of \((t_{0},x_{0},y_{0}) \in \mathbb{L}_{\mu }\) . Let  \(\hat{F}_{2}\) and  \(\mathcal{R}\) be well-defined according to the above construction and let (t,x) with x = (x 1 ,x 2 ) be given such that (t,x) is in the domain of  \(\hat{F}_{2}\) and (t,x 1 ) is in the domain of  \(\mathcal{R}\) . Then the following statements are equivalent:

  1. (a)

    There exists y such that F μ (t,x,y) = 0.

  2. (b)

    \(\hat{F}_{2}(t,x) = 0\) .

  3. (c)

    \(x_{2} = \mathcal{R}(t,x_{1})\) .

Proof

The statements (b) and (c) are equivalent due to the implicit function theorem defining \(\mathcal{R}\). Assuming (a), let there be y such that F μ (t, x, y) = 0. Then, \(x_{2} = \mathcal{G}(t,x_{1},y_{1}) = \mathcal{R}(t,x_{1})\) due to the implicit function theorem defining \(\mathcal{G}\) and Lemma 1. Assuming (c), we set \(y = \mathcal{Y}(t,x)\). With \(\hat{F}_{2}(t,x) = 0\), the relation (16.8) yields F μ (t, x, y) = 0. □ 

Theorem 4

Let F from (16.1) satisfy Hypothesis  1 with (μ,a,d). Then, \(\hat{F} = (\hat{F}_{1},\hat{F}_{2})\) satisfies Hypothesis  1 with (0,a,d).

Proof

Let \(\hat{\mathbb{L}}_{0} =\hat{ F}^{-1}(\{0\})\) and let \(\hat{Z}_{2},\hat{T}_{2},\hat{Z}_{1}\) denote the matrix functions belonging to \(\hat{F}\) as addressed by Hypothesis 1.

For \((t_{0},x_{0},y_{0}) \in F_{\mu }^{-1}(\{0\})\), the above construction yields \(\hat{F}_{2}(t_{0},x_{0}) = 0\). If \(\dot{x}_{0}\) denotes the first n components of y 0, then \(F(t_{0},x_{0},\dot{x}_{0}) = 0\) holds as first block of \(F_{\mu }(t_{0},x_{0},y_{0}) = 0\) implying \(\hat{F}_{1}(t_{0},x_{0},\dot{x}_{0}) = 0\). Hence, \((t_{0},x_{0},\dot{x}_{0}) \in \hat{ \mathbb{L}}_{0}\) and \(\hat{\mathbb{L}}_{0}\) is not empty.

Since \(Z_{1,0}^{T}F_{\dot{x}}(t_{0},x_{0},\dot{x}_{0})\) possesses full row rank due to Hypothesis 1, we may choose \(\hat{Z}_{2}^{T} = [ 0 I_{a} ]\). Differentiating (16.8) with respect to x yields

$$\displaystyle{F_{\mu;x} + F_{\mu;y}\mathcal{Y}_{x} - Z_{2,0}\hat{F}_{2;x} = 0.}$$

Multiplying with Z 2 T from the left, we get \(Z_{2}^{T}Z_{2,0}\hat{F}_{2;x} = Z_{2}^{T}F_{\mu;x}\), where \(Z_{2}^{T}Z_{2,0}\) is nonsingular in a neighborhhood of \((t_{0},x_{0},y_{0})\). Hence, we have

$$\displaystyle{\mathop{\mathrm{kernel}}\nolimits \hat{F}_{2;x} =\mathop{ \mathrm{kernel}}\nolimits Z_{2}^{T}F_{\mu;x}}$$

such that we can choose \(\hat{T}_{2} = T_{2}\). The claim then follows since \(\hat{F}_{1;\dot{x}}T_{2} = Z_{1,0}^{T}F_{\dot{x}}T_{2}\) possesses full column rank due to Hypothesis 1. □ 

Since (16.10) has vanishing strangeness index, it is called a reduced DAE belonging to the original possibly higher index DAE (16.1). Note that a reduced DAE is defined in a neighborhood of every \((t_{0},x_{0},y_{0}) \in \mathbb{L}_{\mu }\), but also that it is not uniquely determined by the original DAE even for a fixed \((t_{0},x_{0},y_{0}) \in \mathbb{L}_{\mu }\). What is uniquely determined for a fixed \((t_{0},x_{0},y_{0}) \in \mathbb{L}_{\mu }\) is (at least when treating it as a function germ) the function \(\mathcal{R}\).

Every continuously differentiable solution of (16.10) will satisfy \(x_{2} = \mathcal{R}(t,x_{1})\) pointwise. Thus, it will also satisfy \(\dot{x}_{2} = \mathcal{R}_{t}(t,x_{1}) + \mathcal{R}_{x_{1}}(t,x_{1})\dot{x}_{1}\) pointwise. Using these two relations, we can reduce the relation \(\hat{F}_{1}(t,x_{1},x_{2},\dot{x}_{1},\dot{x}_{2}) = 0\) of (16.10) to

$$\displaystyle{ \hat{F}_{1}(t,x_{1},\mathcal{R}(t,x_{1}),\dot{x}_{1},\mathcal{R}_{t}(t,x_{1}) + \mathcal{R}_{x_{1}}(t,x_{1})\dot{x}_{1}) = 0. }$$
(16.14)

If we now insert \(x_{2} = \mathcal{R}(t,x_{1})\) into (16.8), we obtain

$$\displaystyle{ F_{\mu }(t,x_{1},\mathcal{R}(t,x_{1}),\mathcal{Y}(t,x_{1},\mathcal{R}(t,x_{1}))) = 0. }$$
(16.15)

Differentiating this with respect to x 1 yields

$$\displaystyle{F_{\mu;x_{1}} + F_{\mu;x_{2}}\mathcal{R}_{x_{1}} + F_{\mu;y}(\mathcal{Y}_{x_{1}} + \mathcal{Y}_{x_{2}}\mathcal{R}_{x_{1}}) = 0.}$$

Multiplying with Z 2 T from the left, we get

$$\displaystyle{Z_{2}^{T}[F_{\mu;x_{1}}F_{\mu;x_{2}}]\left [\begin{array}{c} I_{d} \\ \mathcal{R}_{x_{1}} \end{array} \right ] = 0.}$$

Comparing with Hypothesis 1, we see that we may choose

$$\displaystyle{ T_{2} = \left [\begin{array}{c} I_{d} \\ \mathcal{R}_{x_{1}} \end{array} \right ]. }$$
(16.16)

Differentiating now (16.14) with respect to \(\dot{x}_{1}\) and using the definition of \(\hat{F}_{1}\), we find

$$\displaystyle{Z_{1,0}^{T}F_{\dot{ x}_{1}} + Z_{1,0}^{T}F_{\dot{ x}_{2}}\mathcal{R}_{x_{1}} = Z_{1,0}^{T}F_{\dot{ x}}T_{2},}$$

which is nonsingular due to Hypothesis 1. In order to apply the implicit function theorem, we need to require that \((t_{0},x_{10},\dot{x}_{10})\) solves (16.14). Note that this is not a consequence of \((t_{0},x_{0},y_{0}) \in \mathbb{L}_{\mu }\). Under this additional requirement, the implicit function theorem implies the local existence of a function \(\mathcal{L}\) satisfying

$$\displaystyle{ \hat{F}_{1}(t,x_{1},\mathcal{R}(t,x_{1}),\mathcal{L}(t,x_{1}),\mathcal{R}_{t}(t,x_{1}) + \mathcal{R}_{x_{1}}(t,x_{1})\mathcal{L}(t,x_{1})) \equiv 0. }$$
(16.17)

With the help of the functions \(\mathcal{L}\) and \(\mathcal{R}\), we can formulate a further DAE of the form

$$\displaystyle{ \begin{array}{ll} \dot{x}_{1} = \mathcal{L}(t,x_{1}),\qquad &\mbox{ ($d$ differential equations)} \\ x_{2} = \mathcal{R}(t,x_{1}).&\mbox{ ($a$ algebraic equations)} \end{array} }$$
(16.18)

Note that this DAE consists of a decoupled ODE for x 1, where we can freely impose an initial condition as long as we remain in the domain of \(\mathcal{L}\). Having so fixed x 1, the part x 2 follows directly from the second relation. In this sense, (16.18) can be seen as a prototype for a regular DAE.

The further discussion is now dedicated to the relation between (16.18) and the original DAE.

We start with the assumption that the original DAE (16.1) possesses a smooth local solution x in the sense that there is a continuous path \((t,x^{{\ast}}(t),\mathcal{P}(t)) \in \mathbb{L}_{\mu }\) defined on a neighborhood of t 0, where the first block of \(\mathcal{P}\) coincides with \(\dot{x}^{{\ast}}\). Note that if x is (μ + 1)-times continuously differentiable we can just take the path given by \(\mbox{ $\mathcal{P} = (\dot{x}^{{\ast}},\ddot{x}^{{\ast}},\ldots,(d/\mathit{dt})^{\mu +1}x^{{\ast}})$}\). Setting \((t_{0},x_{0},y_{0}) = (t_{0},x^{{\ast}}(t_{0}),\mathcal{P}(t_{0}))\), Theorem 3 yields that \(x_{2}^{{\ast}}(t) = \mathcal{R}(t,x_{1}^{{\ast}}(t))\). Hence, \(\dot{x}_{2}^{{\ast}}(t) = \mathcal{R}_{t}(t,x_{1}^{{\ast}}(t)) + \mathcal{R}_{x_{1}}(t,x_{1}^{{\ast}}(t))\dot{x}_{1}^{{\ast}}(t)\). In particular, Eq. (16.14) is solved by \((t,x_{1},\dot{x}_{1}) = (t,x_{1}^{{\ast}},\dot{x}_{1}^{{\ast}})\). Thus, it follows also that \(\dot{x}_{1}^{{\ast}}(t) = \mathcal{L}(t,x_{1}^{{\ast}}(t))\). In this way, we have proven the following theorem.

Theorem 5

Let F from (16.1) satisfy Hypothesis  1 with (μ,a,d). Then every local solution x of (16.1) in the sense that it extends to a continuous local path \((t,x^{{\ast}}(t),\mathcal{P}(t)) \in \mathbb{L}_{\mu }\) , where the first block of  \(\mathcal{P}\) coincides with  \(\dot{x}^{{\ast}}\) , also solves the reduced problems (16.10) and (16.18).

3.3 The Way Back

To show a converse result to Theorem 5, we need to require the solvability of (16.14) for the local existence of the function \(\mathcal{L}\). For this, we assume that F not only satisfies Hypothesis 1 with (μ, a, d), but also with (μ + 1, a, d). Let now \((t_{0},x_{0},y_{0},z_{0}) \in \mathbb{L}_{\mu +1}\). Due to the construction of F , we have

$$\displaystyle{ F_{\mu +1} = \left [\begin{array}{c} F_{\mu } \\ ( \frac{d} {\mathit{dt}})^{\mu +1}F \end{array} \right ],\quad F_{\mu +1;y,z} = \left [\begin{array}{cc} F_{\mu;y} & 0 \\ (( \frac{d} {\mathit{dt}})^{\mu +1}F)_{ y}&(( \frac{d} {\mathit{dt}})^{\mu +1}F)_{ z} \end{array} \right ], }$$
(16.19)

where the independent variable z is a short-hand notation for x (μ+2). Since F μ; y and F μ+1; y, z are assumed to have the same rank drop, we find that Z 2 belonging to F μ satisfies

$$\displaystyle{[Z_{2}^{T}0]F_{\mu +1;y,z} = [Z_{2}^{T}0]\left [\begin{array}{cc} F_{\mu;y} & 0 \\ (( \frac{d} {\mathit{dt}})^{\mu +1}F)_{ y}&(( \frac{d} {\mathit{dt}})^{\mu +1}F)_{ z} \end{array} \right ] = [00].}$$

Consequently, in Hypothesis 1 considered for F μ+1, we may choose \([ Z_{2}^{T} 0 ]\) describing the left nullspace of F μ+1; y, z such that the same choices are possible for T 2 and Z 1.

Observing that we may write the independent variables (t, x, y, z) also as \((t,x,\dot{x},\dot{y})\) by simply changing the partitioning of the blocks, and that the equation \(F_{\mu +1} = 0\) contains F μ  = 0 as well as \(\frac{d} {\mathit{dt}}F_{\mu } = 0\), which has the form

$$\displaystyle{ \frac{d} {\mathit{dt}}F_{\mu } = F_{\mu;t} + F_{\mu;x}\dot{x} + F_{\mu;y}\dot{y} = 0,}$$

we get

$$\displaystyle{Z_{2}^{T}F_{\mu;t} + Z_{2}^{T}F_{\mu;x}\dot{x} = 0.}$$

Using the same splitting x = (x 1, x 2) as above and \(\dot{x} = (\dot{x}_{1},\dot{x}_{2})\) accordingly, we obtain

$$\displaystyle{Z_{2}^{T}F_{\mu;t} + Z_{2}^{T}F_{\mu;x_{1}}\dot{x}_{1} + Z_{2}^{T}F_{\mu;x_{2}}\dot{x}_{2} = 0,}$$

which yields

$$\displaystyle{ \dot{x}_{2} = -(Z_{2}^{T}F_{\mu;x_{2}})^{-1}(Z_{ 2}^{T}F_{\mu;t} + Z_{2}^{T}F_{\mu;x_{1}}\dot{x}_{1}). }$$
(16.20)

On the other hand, differentiation of (16.13) with respect to t yields

$$\displaystyle{F_{\mu;t} + F_{\mu;x_{1}}\dot{x}_{1} + F_{\mu;x_{2}}(\mathcal{R}_{t} + \mathcal{R}_{x_{1}}\dot{x}_{1}) + F_{\mu;y_{1}}\dot{y}_{1} + F_{\mu;y_{2}}(\mathcal{H}_{t} + \mathcal{H}_{x_{1}}\dot{x}_{1} + \mathcal{H}_{y_{1}}\dot{y}_{1}) = 0}$$

and thus

$$\displaystyle{Z_{2}^{T}F_{\mu;t} + Z_{2}^{T}F_{\mu;x_{1}}\dot{x}_{1} = -Z_{2}^{T}F_{\mu;x_{2}}(\mathcal{R}_{t} + \mathcal{R}_{x_{1}}\dot{x}_{1}).}$$

Inserting this into (16.20) yields

$$\displaystyle{ \dot{x}_{2} = \mathcal{R}_{t} + \mathcal{R}_{x_{1}}\dot{x}_{1}. }$$
(16.21)

Hence, the given point \((t_{0},x_{0},\dot{x}_{0},\dot{y}_{0})\) satisfies

$$\displaystyle{\dot{x}_{20} = \mathcal{R}_{t}(t_{0},x_{10}) + \mathcal{R}_{x_{1}}(t_{0},x_{10})\dot{x}_{10}.}$$

It then follows that \((t_{0},x_{10},\dot{x}_{10})\) solves (16.14). In particular, this guarantees that the implicit function theorem is applicable to (16.14) leading to a locally defined \(\mathcal{L}\). Thus, the reduced system (16.18) is locally well-defined. Moreover, for every initial value for x 1 near x 10, the initial value problem for x 1 in (16.18) possesses a solution x 1 . The second equation in (16.18) then yields a locally defined x 2 such that \(x^{{\ast}} = (x_{1}^{{\ast}},x_{2}^{{\ast}})\) forms a solution of (16.18).

For the same reasons as for \(\mathbb{L}_{\mu }\), the set \(\mathbb{L}_{\mu +1}\) can be locally parametrized by n + 1 variables. Among these variables are again t and x 1. But since x 2, \(\dot{x}_{1}\), and \(\dot{x}_{2}\) are all functions of (t, x 1), the remaining variables, say p, are now from \(\dot{y}\). In particular, there is a locally defined function \(\mathcal{Z}\) satisfying

$$\displaystyle{F_{\mu +1}(t,x_{1},\mathcal{R}(t,x_{1}),\mathcal{L}(t,x_{1}),\mathcal{R}_{t}(t,x_{1}) + \mathcal{R}_{x_{1}}(t,x_{1})\mathcal{L}(t,x_{1}),\mathcal{Z}(t,x_{1},p)) \equiv 0.}$$

Choosing now \(x_{1}^{{\ast}}(t)\) for x 1 and p (t) arbitrarily within the domain of \(\mathcal{Z}\), for example \(p^{{\ast}}(t) = p_{0}\), where p 0 is the matching part of \(\dot{y}_{0}\), yields

$$\displaystyle{F_{\mu +1}(t,x_{1}^{{\ast}}(t),x_{ 2}^{{\ast}}(t),\dot{x}_{ 1}^{{\ast}}(t),\dot{x}_{ 2}^{{\ast}}(t),\mathcal{Z}(t,x_{ 1}^{{\ast}}(t),p^{{\ast}}(t))) \equiv 0,}$$

which contains

$$\displaystyle{ F(t,x_{1}^{{\ast}}(t),x_{ 2}^{{\ast}}(t),\dot{x}_{ 1}^{{\ast}}(t),\dot{x}_{ 2}^{{\ast}}(t)) \equiv 0 }$$
(16.22)

in the first block. But this means nothing else than that \(x^{{\ast}} = (x_{1}^{{\ast}},x_{2}^{{\ast}})\) locally solves the original problem. Moreover, locally there is a continuous function \(\mathcal{P}\) such that its first block coincides with \(\dot{x}^{{\ast}}\) and \((t,x^{{\ast}}(t),\mathcal{P}(t)) \in \mathbb{L}_{\mu }\). Summarizing, we have proven the following statement.

Theorem 6

If F satisfies Hypothesis  1 with (μ,a,d) and (μ + 1,a,d) then every local solution x of the reduced DAE (16.18) is also a local solution of the original DAE. Moreover, it extends to a continuous local path \((t,x^{{\ast}}(t),\mathcal{P}(t)) \in \mathbb{L}_{\mu }\) , where the first block of  \(\mathcal{P}\) coincides with  \(\dot{x}^{{\ast}}\) .

The numerical treatment of DAEs is usually based on the assumption that there is a solution to be computed. In view of Theorem 5 it is therefore sufficient to work with the derivative array F μ . However, we must assume in addition that the given point \((t_{0},x_{0},y_{0}) \in \mathbb{L}_{\mu }\) provides suitable starting values for the nonlinear system solvers being part of the numerical procedure. Note that this corresponds to the assumption that we may apply the implicit function theorem for the definition of \(\mathcal{L}\).

3.4 Overdetermined Consistent DAEs

Hypothesis 1 can be generalized in various ways. For example, we may include underdetermined problems which would cover control problems by treating states and controls as indistinguishable parts of the unknown. We may also allow overdetermined problems or problems with redundant equations. The main problem in the formulation of corresponding hypotheses is for which points to require properties of the Jacobians of the derivative array equation. Note that the restriction in Hypothesis 1 to points in the solution set of the derivative array equation leads to better covariance properties of the hypothesis, see [13], but it excludes problems where this set is empty, e.g., linear least-squares problems. In the following, we want to present a generalization to overdetermined, but consistent (i.e., solvable) DAEs. Such DAEs may arise by extending a given DAE by some or all hidden constraints, i.e., relations contained in \(\hat{F}_{2}(t,x) = 0\) that require the differentiation of the original DAE, or by extending a given DAE or even an ODE by known first integrals.

Hypothesis 2

There exist (nonnegative) integers μ, a, d, and v such that the set

$$\displaystyle{ \qquad \mathbb{L}_{\mu } =\{ (t,x,y) \in \mathbb{R}^{(\mu +2)n+1}\mid F_{\mu }(t,x,y) = 0\} }$$
(16.23)

associated with F is nonempty and such that for every point \((t_{0},x_{0},y_{0}) \in \mathbb{L}_{\mu }\) , there exists a (sufficiently small) neighborhood  \(\mathbb{V}\) in which the following properties hold:

  1. 1.

    We have \(\mathop{\mathrm{rank}}\nolimits F_{\mu;y} = (\mu +1)m - v\) on  \(\mathbb{L}_{\mu } \cap \mathbb{V}\) such that there exists a smooth matrix function Z 2 of size ((μ + 1)m,v) and pointwise maximal rank, satisfying \(Z_{2}^{T}F_{\mu;y}^{} = 0\) on \(\mathbb{L}_{\mu } \cap \mathbb{V}\) .

  2. 2.

    We have \(\mathop{\mathrm{rank}}\nolimits Z_{2}^{T}F_{\mu;x}^{} = a\) on  \(\mathbb{V}\) such that there exists a smooth matrix function T 2 of size (n,d), \(d = n - a\) , and pointwise maximal rank, satisfying \(Z_{2}^{T}F_{\mu;x}^{}T_{2}^{} = 0\) .

  3. 3.

    We have \(\mathop{\mathrm{rank}}\nolimits F_{\dot{x}}T_{2} = d\) on  \(\mathbb{V}\) such that there exists a smooth matrix function Z 1 of size (m,d) and pointwise maximal rank, satisfying \(\mathop{\mathrm{rank}}\nolimits Z_{1}^{T}F_{\dot{x}}^{}T_{2}^{} = d\) .

A corresponding construction as for Hypothesis 1 shows that Hypothesis 2 implies a reduced DAE of the form (16.10) with the same properties as stated there. In particular, a result similar to Theorem 5 holds. Due to the assumed consistency, the omitted relations (the reduced DAEs are mn scalar relations short) do not contradict these equations. Thus, the solutions fixed by the reduced DAE will be solutions of the original overdetermined DAE under assumptions similar to those of Theorem 6. Since the arguments are along the same lines as presented above, we omit details here.

An example for a problem covered by Hypothesis 2 is given by Example 1 when we just add the two equations obtained by differentiation and elimination to the original DAE leading to a problem consisting of 9 equations in 7 unknowns. A second example, which we will also address in the numerical experiments, consists of an ODE with known first integral.

Example 2

A simple predator/prey model is described by the so-called Lotka/Volterra system

$$\displaystyle{\dot{x}_{1} = x_{1}(1 - x_{2}),\quad \dot{x}_{2} = -c\,x_{2}(1 - x_{1}),}$$

where c > 0 is some given constant, see, e.g., [14]. It is well-known that

$$\displaystyle{H(x_{1},x_{2}) = c(x_{1} -\log x_{1}) + (x_{2} -\log x_{2})}$$

is a first integral of this system implying that the positive solutions are periodic. The combined overdetermined system

$$\displaystyle{\begin{array}{l} \dot{x}_{1} = x_{1}(1 - x_{2}), \\ \dot{x}_{2} = -c\,x_{2}(1 - x_{1}), \\ c(x_{1} -\log x_{1}) + (x_{2} -\log x_{2}) = H_{0}, \end{array} }$$

where \(H_{0} = H(x_{10},x_{20})\) for given initial values \(x_{1}(t_{0}) = x_{10}\), x 2(t 0) = x 20, is therefore consistent. Moreover, it can be shown to satisfy Hypothesis 2 with μ = 0, a = 1, d = 1, and v = 1. In contrast to Example 1, we cannot decide in advance which of the two differential equations should be used together with the algebraic constraint. For stability reasons, we should rather use an appropriate linear combination of the two differential equations. But this just describes the role of Z 1 in Hypothesis 2. ♢

4 Integration of Nonlinear DAEs

In this section, we discuss several issues that play a role when one wants to integrate DAE systems numerically in an efficient way.

4.1 Discretizations

The idea for developing methods for the numerical solution of unstructured DAEs is to discretize not the original DAE (16.1) but the reduced DAE (16.10) because of its property that it does not contain hidden constraints, i.e., that we do not need to differentiate the functions in the reduced DAE. Of course, the functions in the reduced DAE are themselves defined by relations that contain differentiations. But these are differentiations of the original function F which may be obtained by hand or by means of automatic differentiation.

A well-known discretization of DAEs are the BDF methods, see, e.g., [6]. We want to concentrate here on two families of one-step methods that are suitable for the integration of DAEs of the form (16.10). In the following, we denote the initial value at t 0 by x 0 and the stepsize by h. The discretization should then fix an approximate solution x 1 at the point \(t_{1} = t_{0} + h\).

The first family of methods are the Radau IIa methods, which are collocation methods based on the Radau nodes

$$\displaystyle{ 0 <\gamma _{1} < \cdots <\gamma _{s} = 1, }$$
(16.24)

where \(s \in \mathbb{N}\) denotes the number of stages, see, e.g., [10]. The discretization of (16.10) then reads

$$\displaystyle{ \begin{array}{l} \hat{F}_{1}(t_{0} +\gamma _{j}h,X_{j}, \frac{1} {h}(v_{j0}x_{0} +\sum _{ l=1}^{s}v_{ jl}X_{l})) = 0, \\ \hat{F}_{2}(t_{0} +\gamma _{j}h,X_{j}) = 0,\qquad \qquad \qquad j = 1,\ldots,s,\end{array} }$$
(16.25)

together with x 1 = X s , where X j , j = 1, , s, denote the stage values of the Runge-Kutta scheme. The coefficients v jl are determined by the nodes (16.24). For details and the proof of the following convergence result, see, e.g., [13].

Theorem 7

The Radau IIa methods (16.25) applied to a reduced DAE (16.10) are convergent of order \(p = 2s - 1\) .

Note that the Radau IIa methods exhibit the same convergence order as in the special case of an ODE. The produced new value x 1 satisfies all the constraints due to the included relation \(\hat{F}_{2}(t_{1},x_{1}) = 0\).

The second family of methods consists of partitioned collocation methods, which use Gauß nodes for the differential equations and Lobatto nodes for the algebraic equations given by

$$\displaystyle{ 0 <\rho _{1} < \cdots <\rho _{k} < 1,\quad 0 =\sigma _{0} < \cdots <\sigma _{k} = 1, }$$
(16.26)

with \(k \in \mathbb{N}\). Observe that we use one more Lobatto node equating thus the order of the corresponding collocation methods for ODEs. The discretization of (16.10) then reads

$$\displaystyle{ \begin{array}{l} \hat{F}_{1}(t_{0} +\rho _{j}h,u_{j0}x_{0} +\sum _{ l=1}^{k}u_{jl}X_{l}, \frac{1} {h}(v_{j0}x_{0} +\sum _{ l=1}^{k}v_{ jl}X_{l})) = 0, \\ \hat{F}_{2}(t_{0} +\sigma _{j}h,X_{j}) = 0,\qquad \qquad \qquad \qquad \qquad \qquad j = 1,\ldots,k,\end{array} }$$
(16.27)

together with x 1 = X k . The coefficients u jl and v jl are determined by the nodes (16.26). For details and the proof of the following convergence result, see again [13].

Theorem 8

The Gauß-Lobatto methods (16.27) applied to a reduced DAE (16.10) are convergent of order p = 2k.

Note that in contrast to the Radau IIa methods, the Gauß-Lobatto methods are symmetric. Thus, they may be prefered when symmetry of the method is an issue, e.g., in the solution of boundary value problems. In the case of an ODE, the Gauß-Lobatto methods reduce to the corresponding Gauß collocation methods. As for the Radau IIa methods, the produced new value x 1 satisfies all the constraints due to the included relation \(\hat{F}_{2}(t_{1},x_{1}) = 0\).

For the actual computation, we lift the discretization from the reduced DAE to the original DAE by using Theorem 3. In particular, we replace every relation of the form \(\hat{F}_{2}(t,x) = 0\) by F μ (t, x, y) = 0 with the help of an additional unknown y. Note that by this process the system describing the discretization becomes underdetermined. Nevertheless, the desired value x 1 will still (at least locally) be uniquely fixed. The Radau IIa methods then read

$$\displaystyle{ \begin{array}{l} Z_{1,0}^{T}F(t_{0} +\gamma _{j}h,X_{j}, \frac{1} {h}(v_{j0}x_{0} +\sum _{ l=1}^{s}v_{ jl}X_{l})) = 0, \\ F_{\mu }(t_{0} +\gamma _{j}h,X_{j},Y _{j}) = 0,\qquad \qquad \qquad \ j = 1,\ldots,s,\end{array} }$$
(16.28)

and the Gauß-Lobatto methods then read

$$\displaystyle{ \begin{array}{l} Z_{1,0}^{T}F(t_{0} +\rho _{j}h,u_{j0}x_{0} +\sum _{ l=1}^{k}u_{jl}X_{l}, \frac{1} {h}(v_{j0}x_{0} +\sum _{ l=1}^{k}v_{ jl}X_{l})) = 0, \\ F_{\mu }(t_{0} +\sigma _{j}h,X_{j},Y _{j}) = 0,\qquad \qquad \qquad \qquad \qquad \qquad j = 1,\ldots,k.\end{array} }$$
(16.29)

In the case of overdetermined DAEs governed by Hypothesis 2, the discretizations look the same.

In order to perform a step with the above one-step methods given an initial value \((t_{0},x_{0},y_{0}) \in \mathbb{L}_{\mu }\), we can determine Z 1, 0 along the lines of the above hypotheses. We then must provide starting values for a suitable nonlinear system solver for the solution of the nonlinear systems describing the discretization, typically the Gauß-Newton method or a variant of it. Upon convergence, we obtain a final value (t 1, x 1, y 1) as part of the overall solution (which includes the internal stages), which will then be the initial value for the next step. Note that for performing a Gauß-Newton-like method for these problems, which we will write as \(\mathcal{F}(z) = 0\) for short in the following, we must be able to evaluate the function \(\mathcal{F}\) and its Jacobian \(\mathcal{F}_{z}\) at given points. Thus, we must be able to evaluate F and F μ and their Jacobians, which can be done by using automatic differentiation, see below.

4.2 Gauß-Newton-Like Processes

The design of the Gauß-Newton-like method is crucial for the efficiency of the approach. Note that we had to replace \(\hat{F}_{2}\) by F μ thus increasing the number of equations and unknowns significantly. However, there is some structure in the equations that can be utilized in order to improve the efficiency. We will sketch this approach in the following for the case of the Radau IIa discretization. Similar techniques can be applied to the case of the Gauß-Lobatto discretization.

Linearizing the equation \(\mathcal{F}(z) = 0\) around some given z yields the linear problem \(\mathcal{F}(z) + \mathcal{F}_{z}(z)\varDelta z = 0\) for the correction Δ z. The ordinary Gauß-Newton method is then characterized by solving for Δ z by means of the Moore-Penrose pseudoinverse \(\mathcal{F}_{z}(z)^{+}\) of \(\mathcal{F}_{z}(z)\), i.e.,

$$\displaystyle{ \varDelta z = -\mathcal{F}_{z}(z)^{+}\mathcal{F}(z). }$$
(16.30)

Instead of the Moore-Penrose pseudoinverse, we are allowed to use any other equation-solving generalized inverse of \(\mathcal{F}_{z}(z)\). Due to the consistency of the nonlinear problem to be solved, we are also allowed to perturb the Jacobian as long as the perturbation is sufficiently small or even tends to zero during the iteration.

In the case (16.28), linearization leads to

$$\displaystyle{ \begin{array}{l} Z_{1,0}^{T}F_{x}^{j}\varDelta X_{j} + Z_{1,0}^{T}F_{\dot{x}}^{j}\frac{1} {h}\sum _{l=1}^{s}v_{ jl}\varDelta X_{l} = -Z_{1,0}^{T}F^{j}, \\ F_{\mu;x}^{j}\varDelta X_{j} + F_{\mu;y}^{j}\varDelta Y _{j} = -F_{\mu }^{j},\qquad \qquad \qquad j = 1,\ldots,s.\end{array} }$$
(16.31)

which is to be solved for (Δ X j , Δ Y j ), j = 1, , s. The superscript j indicates, that the corresponding function is evaluated at the argument occurring in the j-th equation, i.e., at \((t_{0} +\gamma _{j}h,X_{j}, \frac{1} {h}(v_{j0}x_{0} +\sum _{ l=1}^{s}v_{ jl}X_{l}))\) in the case of F and \((t_{0} +\gamma _{j}h,X_{j},Y _{j})\) in the case of F μ . Since (16.28) contains \(F_{\mu }^{j} = 0\), we will have \(\mathop{\mathrm{rank}}\nolimits F_{\mu;y}^{j} = (\mu +1)n - a\) at a solution of (16.28) due to Hypothesis 1. Near the solution, the matrix \(F_{\mu;y}^{j}\) is thus a perturbation of a matrix with rank drop a. The idea therefore is to perturb \(F_{\mu;y}^{j}\) to a matrix M j with \(\mathop{\mathrm{rank}}\nolimits M_{j} = (\mu +1)n - a\). Such a perturbation can be obtained by rank revealing QR decomposition or by singular value decomposition, see, e.g., [7]. The second part of (16.31) then consists of equations of the form

$$\displaystyle{ F_{\mu;x}^{j}\varDelta X_{ j} + M_{j}\varDelta Y _{j} = -F_{\mu }^{j}. }$$
(16.32)

With the help of an orthogonal matrix \([ Z_{2,j}^{{\prime}} Z_{2,j}^{} ]\), where the columns of Z 2, j form an orthonormal basis of the left nullspace of M j , we can split (16.32) into

$$\displaystyle{ Z_{2,j}^{{\prime}T}F_{\mu;x}^{j}\varDelta X_{ j} + Z_{2,j}^{{\prime}T}M_{ j}\varDelta Y _{j} = -Z_{2,j}^{{\prime}T}F_{\mu }^{j},\quad Z_{ 2,j}^{T}F_{\mu;x}^{j}\varDelta X_{ j} = -Z_{2,j}^{T}F_{\mu }^{j}. }$$
(16.33)

The first part can be solved for Δ Y j via the Moore-Penrose pseudoinverse

$$\displaystyle{ \varDelta Y _{j} = -(Z_{2,j}^{{\prime}T}M_{ j})^{+}Z_{ 2,j}^{{\prime}T}(F_{\mu }^{j} + F_{\mu;x}^{j}\varDelta X_{ j}) }$$
(16.34)

in terms of Δ X j , thus fixing a special equation-solving pseudoinverse of the Jacobian under consideration. In order to determine the corrections Δ X j , we take an orthogonal matrix \([ T_{2,j}^{{\prime}} T_{2,j}^{} ]\), where the columns of T 2, j form an orthonormal basis of the right nullspace of \(Z_{2,j}^{{\prime}T}F_{\mu;x}^{j}\), which is of full row rank near the solution due to Hypothesis 1. Defining the transformed corrections

$$\displaystyle{ \varDelta V _{j}^{{\prime}} = T_{ 2,j}^{{\prime}T}\varDelta X_{ j},\quad \varDelta V _{j}^{} = T_{2,j}^{T}\varDelta X_{ j}, }$$
(16.35)

we have \(\varDelta X_{j} = T_{2,j}^{{\prime}}\varDelta V _{j}^{{\prime}} + T_{2,j}^{}\varDelta V _{j}^{}\) and the second part of (16.33) becomes

$$\displaystyle{ Z_{2,j}^{T}F_{\mu;x}^{j}T_{ 2,j}^{{\prime}}\varDelta V _{ j}^{{\prime}} = -Z_{ 2,j}^{T}F_{\mu }^{j}. }$$
(16.36)

Due to Hypothesis 1, the square matrix \(Z_{2,j}^{T}F_{\mu;x}^{j}T_{2,j}^{{\prime}}\) is nonsingular near a solution such that we can solve for \(\varDelta V _{j}^{{\prime}}\) to get

$$\displaystyle{ \varDelta V _{j}^{{\prime}} = -(Z_{ 2,j}^{T}F_{\mu;x}^{j}T_{ 2,j}^{{\prime}})^{-1}Z_{ 2,j}^{T}F_{\mu }^{j}. }$$
(16.37)

Finally, transforming the equation in the first part of (16.31) to the variables \((\varDelta V _{j}^{{\prime}},\varDelta V _{j}^{})\) and eliminating the terms \(\varDelta V _{j}^{{\prime}}\) leaves a system in the unknowns Δ V j , which is of the same size and form as if we would discretize an ODE of d equations by means of the Radau IIa method. This means that we actually have reduced the complexity to that of solving an ODE of the size of the differential part. Solving this system for the quantities Δ V j and combining these with the already obtained values \(\varDelta V _{j}^{{\prime}}\) then yields the corrections Δ X j .

The overall Gauß-Newton-like process, which can be written as

$$\displaystyle{ \varDelta z = -\mathcal{J} (z)^{+}\mathcal{F}(z) }$$
(16.38)

with \(\mathcal{J} (z) \rightarrow \mathcal{F}_{z}(z)\) when z converges to a solution, can be shown to be locally and quadratically convergent, see again [13]. Using such a process is indispensable for the efficient numerical solution of unstructured DAEs.

4.3 Minimal-Norm-Corrected Gauß-Newton Method

We have implemented the approach of the previous section both for the Radau IIa methods and for the Gauß-Lobatto methods. Experiments show that one can successfully solve nonlinear DAEs even for larger values of μ without having to assume a special structure. Applying it to the problem of Example 1, however, reveals a drawback of the approach described so far. In particular, we observe the following. Trying to solve the problem of Example 1 on a larger time interval starting at t = 0, one realizes that the integration terminates at about t = 14. 5 because the nonlinear system solver fails, cp. Fig. 16.1. A closer look shows that the reason for this is that the undetermined components y, which are not relevant for the solution one is interested in, run out of scale. Scaling techniques cannot avoid the effect. They can only help to make use of the whole range provided by the floating point arithmetic. Using diagonal scaling, the iteration terminates then at about t = 71. 4, cp. again Fig. 16.1.

Fig. 16.1
figure 1

Decadic logarithm of the Euclidean norm of the generated numerical solution (x i , y i )

Actually, proceeding from numerical approximations (x i , y i ) at t i to numerical approximations \((x_{i+1},y_{i+1})\) at t i+1 consists of two mechanisms. First, we must provide a starting value z for the nonlinear system solver. We call this predictor and write

$$\displaystyle{ z = \mathfrak{P}(x_{i},y_{i}). }$$
(16.39)

Then, the nonlinear system solver, called corrector in this context, yields the new approximation according to

$$\displaystyle{ (x_{i+1},y_{i+1}) = \mathfrak{C}(z). }$$
(16.40)

Thus the numerical flow Φ of our method effectively has the form

$$\displaystyle{ (x_{i+1},y_{i+1}) =\varPhi (x_{i},y_{i}),\quad \varPhi = \mathfrak{C} \circ \mathfrak{P}. }$$
(16.41)

The problem can then be described as follows. Even if the actual solution and the numerical approximations x i are bounded, there is no guaranty that the overall numerical solutions (x i , y i ) stay bounded.

In [3], it was examined how different predictors \(\mathfrak{P}\), in particular extrapolation of some order, influence the overall behavior of the process. The result was that linear extrapolation should be prefered to higher order extrapolation. However, even linear extrapolation cannot avoid the blow-up.

The idea here is to modify the corrector \(\mathfrak{C}\), in particular to introduce damping into the nonlinear system solver. Recall that the nonlinear system to be solved does in general not have a unique solution but that the part one is interested in, namely x i+1, is unique. Consider the iteration given by

$$\displaystyle{ \varDelta z = -\alpha z -\mathcal{F}_{z}(z)^{+}(\mathcal{F}(z) -\alpha \mathcal{F}_{ z}(z)z) }$$
(16.42)

with α ∈ [0, 1] replacing (16.30). For α = 0, we rediscover (16.30). For α = 1, we have

$$\displaystyle{z +\varDelta z = \mathcal{F}_{z}(z)^{+}(\mathcal{F}_{ z}(z)z -\mathcal{F}(z)),}$$

which in the linear case \(\mathcal{F}(z) = \mathbf{A}z -\mathbf{b}\) leads to \(z +\varDelta z = \mathbf{A}^{+}\mathbf{b}\) and thus to the shortest solution with respect to the Euclidean norm. In this sense, the process defined by (16.42) contains some damping. Moreover, if α → 0 quadratically during the iteration, we maintain the quadratic convergence of the Gauß-Newton process. The following result is due to [2].

Theorem 9

Consider the problem \(\mathcal{F}(z) = 0\) and assume that the Jacobians \(\mathcal{F}_{z}(z)\) have full row rank. Furthermore, consider the iteration defined by (16.42) and assume that α → 0 quadratically during the iteration. Then the so defined process yields iterates that converge locally and quadratically to a solution of the given problem.

Observe that replacing (16.30) by (16.42) only consists of a slight modification of the original process. The main computational effort, namely the representation of \(\mathcal{F}_{z}(z)^{+}\), stays the same. Moreover, using a perturbed Jacobian \(\mathcal{J} (z)\) instead of \(\mathcal{F}_{z}(z)\) is still possible and does not influence the convergence behavior. Figure 16.1 shows that with this modified nonlinear system solver we are now able to produce bounded overall solutions in the case of Example 1.

4.4 Automatic Differentiation

In order to integrate (unstructured) DAEs, we must provide procedures for the evaluation of F and F μ together with their Jacobians. As already mentioned this can be done by exploiting techniques from automatic differentiation, see, e.g., [9].

The simplest approach is to evaluate the functions on the fly, i.e., by using special classes and overloaded operators, a call of a template function which implements F can produce the needed evaluations just by changing the class of the variables. The drawback in this approach is that there may be a lot of trivial computations when the derivatives are actually zero. Moreover, no code optimization is possible.

An alternative approach consists of two phases. First, one uses automatic differentiation to produce code for the evaluation of the needed functions. This code can then be easily compiled using optimization. The drawback here is that one has to adapt the automatic differentiation process or the produced code to the form one needs for the following integration of the DAE. Nevertheless, one can expect this approach to be more efficient for the actual integration of the DAE, especially for larger values of μ. Actually, one would prefer the first approach while a model is developed. If the model is finalized, one would then prefer the second approach.

As an example, we have run the problem from Example 1 with both approaches on the interval [0, 100] using the Gauß-Lobatto method for k = 3 and the minimal-norm-corrected Gauß-Newton-like method starting with α = 0. 1 and using successive squaring. The computing time in the first case exploiting automatic differentiation on the fly was 2.8 s. The computing time in the second case exploiting optimized code produced by automatic differentiation was 0.6 s.

4.5 Exploiting First Integrals

If for a given ODE or DAE model first integrals are known, they should be included into the model thus enforcing the produced numerical approximations to obey these first integrals. The enlarged system is of course overdetermined but consistent. In general, it is not clear how to deduce a square system from the overdetermined one in order to apply standard integration procedures, cp. Example 2.

In Example 1, there are two hidden constraints which were found by differentiation. As already mentioned there, it is in this case possible to reduce the problem consisting of the original equations and the two additional constraints to a square system by just omitting two equations of the original system. Sticking to automatic differentiation and using the same setting as above, we can solve the overdetermined system in 0.9 s and the reduced square system in 0.7 s.

For Example 2, such a beforehand reduction is not so obvious, but still possible due to the simple structure of this specific problem. We solved the overdetermined problem by means of the implicit Euler method (which is the Radau IIa method for s = 1) as well as the original ODE by means of the explicit and implicit Euler method performing 1,000 steps with stepsize h = 0. 02. The results are shown in Fig. 16.2. As one would expect, the numerical solution for the ODE produced by the explicit Euler method spirals outwards thus increasing the energy while the numerical solution for the ODE produced by the implicit Euler method spirals inwards thus decreasing energy. The numerical solution obtained from the overdetermined system, of course, conserves the energy by construction.

Fig. 16.2
figure 2

Numerical solutions for the Lotka/Volterra model

4.6 Path Following by Arclength Parametrization

There are two extreme cases of DAEs, the case of ODEs \(\dot{x} = f(t,x)\) on the one hand and the case of nonlinear equations f(x) = 0 on the other hand. For \(F(t,x,\dot{x}) =\dot{ x} - f(t,x)\), Hypothesis 1 is trivially satisfied with μ = 0, a = 0, and d = n. For \(F(t,x,\dot{x}) = f(x)\), Hypothesis 1 is satisfied with μ = 0, a = n, and d = 0, provided f x (x) is nonsingular for all \(x \in \mathbb{L}_{0}\). Since t does neither occur as an argument nor via differentiated variables, the solutions are constant in time and thus, as solutions of a DAE, not so interesting. This changes if one considers parameter dependent nonlinear equations f(x, τ) = 0, where τ shall be a scalar parameter. The problem is now underdetermined. Thus, it cannot satisfy one of the above hypotheses. Under the assumption that \([ f_{x} f_{\tau } ]\) has full row rank for all \((x,\tau ) \in \mathbb{M} = f^{-1}(\{0\})\neq \varnothing \), the solution set forms a one-dimensional manifold. If one is interested in tracing this manifold, one can use path following techniques, see, e.g., [4, 17]. However, it is also possible to treat such problems with solution techniques for DAEs. A first choice would be to interpret the parameter τ as time t of the DAE. This would, however, imply that the parameter τ is strictly monotone along the one-dimensional manifold. But there are applications, where this is not the case. It may even happen that the points where the parameter τ is extremal are of special interest. In order to treat such problems, we are in need of defining a special type of time which is monotone in any case. Such a quantity is given as the arclength of the one-dimensional manifold, measured say from the initial point we start off. Since the arclength parametrization of a path is characterized by the property that the derivative with respect to the parametrization has Euclidean length one, we consider the DAE

$$\displaystyle{ f(x,\tau ) = 0,\quad \|\dot{x}\|_{2}^{2} + \vert \dot{\tau }\vert ^{2} = 1 }$$
(16.43)

for the unknown (x, τ). If \((x_{0},\tau _{0}) \in \mathbb{M}\) and \([ f_{x} f_{\tau } ]\) is of full row rank on \(\mathbb{M}\), the implicit function theorem yields that there is a local solution path \((\hat{x}(t),\hat{\tau }(t))\) passing through (x 0, τ 0). Moreover, \(\|\dot{\hat{x}}(t)\|_{2}^{2} + \vert \dot{\hat{\tau }}(t)\vert ^{2} = 1\), when we parametrize by arclength. Hence, the DAE (16.43) possesses a solution. Moreover, writing (16.43) as \(F(z,\dot{z}) = 0\) with z = (x, τ), we have

$$\displaystyle{\mathbb{L}_{0} =\{ (z,\dot{z})\mid z = (\hat{x}(t),\hat{\tau }(t)),\ \dot{z} = (\dot{\hat{x}}(t),\dot{\hat{\tau }}(t))\}}$$

in Hypothesis 1. Because of

$$\displaystyle{F_{0,\dot{z}} = \left [\begin{array}{cc} 0 &0\\ 2\dot{x}^{T } &2\dot{\tau } \end{array} \right ],\quad F_{0;z} = \left [\begin{array}{cc} f_{x}&f_{\tau }\\ 0 & 0 \end{array} \right ],}$$

we may choose

$$\displaystyle{Z_{2} = \left [\begin{array}{c} I_{n}\\ 0 \end{array} \right ].}$$

By assumption, \(Z_{2}^{T}F_{0;z} = [ f_{x} f_{\tau } ]\) has full row rank and we may choose T 2 as a normalized vector in \(\mathop{\mathrm{kernel}}\nolimits [ f_{x} f_{\tau } ]\), which is one-dimensional. In particular, we may choose

$$\displaystyle{T_{2} = \left [\begin{array}{c} \dot{\hat{x}}\\ \dot{\hat{\tau }} \end{array} \right ]}$$

on \(\mathbb{L}_{0}\). Finally, we observe that

$$\displaystyle{F_{\dot{z}}T_{2} = \left [\begin{array}{cc} 0 &0\\ 2\dot{\hat{x}}^{T } &2\dot{\hat{\tau }} \end{array} \right ]\left [\begin{array}{c} \dot{\hat{x}}\\ \dot{\hat{\tau }} \end{array} \right ] = \left [\begin{array}{c} 0\\ 2 \end{array} \right ]}$$

has full column rank at the solution and thus in a neighborhood of it. Hence, the DAE (16.43) satisfies Hypothesis 1 with μ = 0, a = n, and d = 1, where n denotes the size of x. We can then use DAE solution techniques to solve (16.43) thus tracing the solution path of the original parametrized system of nonlinear equations.

In order to determine points along the path, where the parameter τ is extremal, we may combine the DAE (16.43) with a root finding procedure, e.g., along the lines of [18] or the references therein. The points of interests are characterized by the condition \(\dot{\tau }= 0\). We therefore augment the DAE (16.43) according to

$$\displaystyle{ f(x,\tau ) = 0,\quad \|\dot{x}\|_{2}^{2} + \vert \dot{\tau }\vert ^{2} = 1,\quad w-\dot{\tau } = 0, }$$
(16.44)

and try to locate points along the solution satisfying w = 0. Writing the DAE (16.44) again as F(z) = 0, where now z = (x, τ, w), we have

$$\displaystyle{\mathbb{L}_{0} =\{ (z,\dot{z})\mid z = (\hat{x}(t),\hat{\tau }(t),\dot{\hat{\tau }}(t)),\ \dot{z} = (\dot{\hat{x}}(t),\dot{\hat{\tau }}(t),\ddot{\hat{\tau }}(t))\}}$$

in Hypothesis 1. Because of

$$\displaystyle{F_{0;\dot{z}} = \left [\begin{array}{ccc} 0 & 0 &0 \\ 2\dot{x}^{T}& 2\dot{\tau } &0\\ 0 & - 1 &0 \end{array} \right ],\quad F_{0;z} = \left [\begin{array}{ccc} f_{x}&f_{\tau }&0 \\ 0 & 0 &0\\ 0 & 0 &1 \end{array} \right ],}$$

we may choose

$$\displaystyle{Z_{2} = \left [\begin{array}{c} I_{n} \\ 0\\ 0 \end{array} \right ].}$$

Along the same lines as above, we may now choose

$$\displaystyle{T_{2} = \left [\begin{array}{cc} \dot{\hat{x}}&0\\ \dot{\hat{\tau }} &0 \\ 0 &1 \end{array} \right ]}$$

on \(\mathbb{L}_{0}\). We then observe that

$$\displaystyle{F_{\dot{z}}T_{2} = \left [\begin{array}{ccc} 0 & 0 &0 \\ 2\dot{\hat{x}}^{T}& 2\dot{\hat{\tau }} &0\\ 0 & - 1 &0 \end{array} \right ]\left [\begin{array}{cc} \dot{\hat{x}}&0\\ \dot{\hat{\tau }} &0 \\ 0 &1 \end{array} \right ] = \left [\begin{array}{cc} 0 &0\\ 2 &0 \\ -\dot{\hat{\tau }}&0 \end{array} \right ]}$$

fails to have full column rank at the solution. Thus, Hypothesis 1 cannot hold with μ = 0. We therefore consider Hypothesis 1 for μ = 1. Starting from

$$\displaystyle{F_{1;\dot{z},\ddot{z}} = \left [\begin{array}{ccc|ccc} 0 & 0 &0 \\ 2\dot{x}^{T} & 2\dot{\tau } &0\\ 0 & - 1 &0 \\ \hline f_{x}& f_{\tau } &0& 0 & 0 &0 \\ {\ast} & {\ast} &0&2\dot{x}^{T}& 2\dot{\tau } &0\\ 0 & 0 &1 & 0 & - 1 &0 \end{array} \right ],\quad F_{1;z} = \left [\begin{array}{ccc} f_{x} &f_{\tau }&0 \\ 0 & 0 &0\\ 0 & 0 &1 \\ \hline {\ast}&{\ast}&0\\ 0 & 0 &0 \\ 0 & 0 &0 \end{array} \right ],}$$

we use the fact that \(0\neq (\dot{x}^{T},\tau )^{T} \in \mathop{\mathrm{kernel}}\nolimits [ f_{x} f_{\tau } ]\) at a solution and therefore

$$\displaystyle{\left [\begin{array}{cc} f_{x} &f_{\tau } \\ \dot{x}^{T}& \dot{\tau } \end{array} \right ]\mbox{ nonsingular}}$$

near the solution to deduce that \(\mathop{\mathrm{rank}}\nolimits F_{1;\dot{z},\ddot{z}} = n + 3\). Choosing

$$\displaystyle{Z_{2} = \left [\begin{array}{cc} I_{n} &0 \\ 0 &{\ast}\\ 0 &1 \\ \hline 0&{\ast}\\ 0 &0 \\ 0 &0 \end{array} \right ]}$$

gives

$$\displaystyle{Z_{2}^{T}F_{ 1;z} = \left [\begin{array}{ccc} f_{x}&f_{\tau }&0 \\ {\ast} &{\ast}&1 \end{array} \right ],}$$

which has full row rank by assumption. Choosing

$$\displaystyle{T_{2} = \left [\begin{array}{c} \dot{\hat{x}}\\ \dot{\hat{\tau }}\\ {\ast} \end{array} \right ]}$$

at the solution then yields

$$\displaystyle{F_{\dot{z}}T_{2} = \left [\begin{array}{ccc} 0 & 0 &0 \\ 2\dot{\hat{x}}^{T}& 2\dot{\hat{\tau }} &0\\ 0 & - 1 &0 \end{array} \right ]\left [\begin{array}{c} \dot{\hat{x}}\\ \dot{\hat{\tau }}\\ {\ast} \end{array} \right ] = \left [\begin{array}{c} 0\\ 2 \\ -\dot{\hat{\tau }}\end{array} \right ].}$$

Hence, the DAE (16.44) satisfies Hypothesis 1 with μ = 1, \(a = n + 1\), and d = 1, and we can treat (16.44) by the usual techniques. The location of points \(\hat{t}\) with \(\dot{\tau }(\hat{t}) = 0\) can now be seen as a root finding problem along solutions of (16.44) for the function g defined by

$$\displaystyle{ g(x,\tau,w) = w }$$
(16.45)

In particualar, it can be treated by standard means of root finding techniques.

In order to be able to determine a root \(\hat{t}\) of g, we need that this root is simple, i.e., that

$$\displaystyle{ \frac{d} {\mathit{dt}}g(\hat{x}(t),\hat{\tau }(t),\hat{w}(t))\vert _{t=\hat{t}}\neq 0,\quad \hat{w}(t) =\dot{\hat{\tau }} (t). }$$
(16.46)

In the case of (16.45), this condition simply reads

$$\displaystyle{ \ddot{\hat{\tau }}(\hat{t})\neq 0. }$$
(16.47)

In order to determine \(\ddot{\hat{\tau }}(\hat{t})\), we start with \(f(\hat{x}(t),\hat{\tau }(t)) = 0\) along the solution. Differentiating twice yields (omitting arguments)

$$\displaystyle{ f_{x}\dot{\hat{x}} + f_{\tau }\dot{\hat{\tau }} = 0 }$$
(16.48)

and

$$\displaystyle{ f_{\mathit{xx}}(\dot{\hat{x}},\dot{\hat{x}}) + f_{x\tau }(\dot{\hat{x}})(\dot{\hat{\tau }}) + f_{x}\ddot{\hat{x}} + f_{x\tau }(\dot{\hat{x}})(\dot{\hat{\tau }}) + f_{\tau \tau }(\dot{\hat{\tau }},\dot{\hat{\tau }}) + f_{\tau }\ddot{\hat{\tau }} = 0. }$$
(16.49)

Since \(\dot{\hat{\tau }}(\hat{t}) = 0\), the relation (16.48) gives

$$\displaystyle{ f_{x}(x^{{\ast}},\tau ^{{\ast}})v = 0,\quad v =\dot{\hat{ x}}(\hat{t})\neq 0, }$$
(16.50)

with \(x^{{\ast}} =\hat{ x}(\hat{t})\) and \(\tau ^{{\ast}} =\hat{\tau } (\hat{t})\) for short. Thus, the square matrix \(f_{x}(x^{{\ast}},\tau ^{{\ast}})\) is rank-deficient such that there is a vector u ≠ 0 with

$$\displaystyle{ u^{T}f_{ x}(x^{{\ast}},\tau ^{{\ast}}) = 0. }$$
(16.51)

Multiplying (16.49) with u T from the left and evaluating at \(\hat{t}\) yields

$$\displaystyle{ u^{T}f_{\mathit{ xx}}(x^{{\ast}},\tau ^{{\ast}})(v,v) + u^{T}f_{\tau }(x^{{\ast}},\tau ^{{\ast}})\ddot{\hat{\tau }}(\hat{t}) = 0. }$$
(16.52)

Assuming now that

$$\displaystyle{ u^{T}f_{\mathit{ xx}}(x^{{\ast}},\tau ^{{\ast}})(v,v)\neq 0,\quad u^{T}f_{\tau }(x^{{\ast}},\tau ^{{\ast}})\neq 0 }$$
(16.53)

guarantees

$$\displaystyle{ \ddot{\hat{\tau }}(\hat{t}) = -(u^{T}f_{\tau }(x^{{\ast}},\tau ^{{\ast}}))^{-1}(u^{T}f_{ xx}(x^{{\ast}},\tau ^{{\ast}})(v,v))\neq 0. }$$
(16.54)

Note that the assumptions for (x , τ ) we have required here are just those that characterize a so-called simple turning point, see, e.g., [8, 15].

Example 3

Consider the example

$$\displaystyle{\begin{array}{l} \tau (1 - x_{3})\exp (10x_{1})/(1 + 0.01x_{1}) - x_{3} = 0, \\ 22\tau (1 - x_{3})\exp (10x_{1})/(1 + 0.01x_{1}) - 30x_{1} = 0, \\ x_{3} - x_{4} +\tau (1 - x_{3})\exp (10x_{2})/(1 + 0.01x_{2}) = 0, \\ 10x_{1} - 30x_{2} + 22\tau (1 - x_{4})\exp (10x_{2})/(1 + 0.01x_{2}) = 0, \end{array} }$$

from [11]. Starting from the trivial solution into the positive cone, the solution path exhibits six turning points before the solution becomes nearly independent of τ, see Fig. 16.3, which has been produced by solving the corresponding DAE (16.44) by the implicit Euler method combined with standard root finding techniques. ♢

Fig. 16.3
figure 3

Solution path for Example 3 projected into the (τ, x 2)-plane

5 Conclusions

We revised the theory of regular nonlinear DAEs of arbitrary index and gave some extensions to overdetermined but consistent DAEs. We also discussed several computational issues in the numerical treatment of such DAEs, namely suitable discretizations, efficient nonlinear system solvers and their stabilization, as well as automatic differentiation. We finally presented a DAE approach for numerical path following for parametrized systems of nonlinear equations including the detection and determination of (simple) turning points.