1 Introduction

It is well-known that numerical computations may sometimes lead to wrong results because of the accumulation of roundoff errors [8]. Recently, much work has been done to detect these accuracy errors in finite precision computations [1], by static [6, 9, 18] or dynamic [7] analysis, to find the least data formats needed to ensure a certain accuracy (precision tuning) [11, 12, 17] and to optimize the accuracy by program transformation [5, 14]. All these techniques are used late in the software development cycle, once the programs are entirely written.

In this article, we aim at exploring a different direction. We aim at detecting and correcting numerical accuracy errors at software development time, i.e. during the programming phase. From a software engineering point of view, the advantages of our approach are many since it is well-known that late bug detection is time and money consuming. We also aim at using intensively used techniques recognized for their ability to discard run-time errors. This choice is motivated by efficiency reasons as well as for end-user adoption reasons.

We propose an ML-like type system (strong, implicit, polymorphic [15]) for numerical computations in which the type of an arithmetic expression carries information on its accuracy. We use dependent types [16] and a type inference which, from the user point of view, acts like ML [13] type inference [15] even if it slightly differs in its implementation. While type systems have been widely used to prevent a large variety of software bugs, to our knowledge, no type system has been targeted to address numerical accuracy issues in finite precision computations. Basically, our type system accepts expressions for which it may ensure a certain accuracy on the result of the evaluation and it rejects expressions for which a minimal accuracy on the result of the evaluation cannot be inferred.

Let us insist on the fact that we use a dependent type system. Consequently, the type corresponding to a function of some argument x depends on the type of x itself. The soundness of our type system relies on a subject reduction theorem introduced in Sect. 4. Based on an instrumented operational semantics computing both the finite precision and exact results of a numerical computation, this theorem shows that the error on the result of the evaluation of some expression e is less than the error predicted by the type of e. Obviously, as any non-trivial type system, our type system is not complete and rejects certain programs that would not produce unbounded numerical errors. Our type system has been implemented in a prototype language Numl and we show that, in practice, our type system is expressive enough to type implementations of usual simple numerical algorithms [2] such as the ones of Sect. 5. Let us also mention that our type system represents a new application of dependent type theory motivated by applicative needs. Indeed, dependent types arise naturally in our context since accuracy depends on values.

This article is organized as follows. Section 2 introduces informally our type system and shows how it is used in our implementation of a ML-like programming language, Numl. The formal definition of the types and of the inference rules are given in Sect. 3. A soundness theorem is given in Sect. 4. Section 5 presents experimental results and Sect. 6 concludes.

2 Programming with Types for Numerical Accuracy

In this section, we present informally how our type system works throughout a programming sequence in our language, Numl. First of all, we use real numbers \(\mathsf {r}\{s,u,p\}\) where \(\mathsf {r}\) is the value itself, and \(\{s,u,p\}\) the format of \(\mathsf {r}\). The format of a real number is made of a sign \(s\in \mathsf {Sign}\) and integers \(u,p\in \mathsf {Int}\) such that u is the unit in the first place of \(\mathsf {r}\), written \(\mathsf {ufp}(\mathsf {r})\) and p the precision (i.e. the number of digits of the number). For inputs, p is either explicitly specified by the user or set by default by the system. For outputs, p is inferred by the type system. We have \(\mathsf {Sign}=\{0,\oplus ,\ominus ,\top \}\) and \(\mathsf {sign(r)}=0\) if \(\mathsf {r}=0\), \(\mathsf {sign(r)}=\oplus \) if \(\mathsf {r}>0\) and \(\mathsf {sign(r)}=\ominus \) if \(\mathsf {r}<0\). The set \(\mathsf {Sign}\) is equipped with the partial order relation \(\prec \subseteq \mathsf {Sign} \times \mathsf {Sign}\) defined by \(0\prec \oplus \), \(0\prec \ominus \), \(\oplus \prec \top \) and \(\ominus \prec \top \). The \(\mathsf {ufp}\) of a number x is

$$\begin{aligned} \mathsf {ufp}(x) = \min \big \{i\in \mathbb {N}\ :\ 2^{i+1} > x\big \} = \lfloor \log _2(x)\rfloor . \end{aligned}$$
(1)

The term p defines the precision of \(\mathsf {r}\). Let \(\varepsilon (\mathsf {r})\) be the absolute error on \(\mathsf {r}\), we assume that \( \varepsilon (\mathsf {r})< 2^{u-p+1}. \) The errors on the numerical constants arising in programs are specified by the user or determined by default by the system. The errors on the computed values can be inferred by propagation of the initial errors. Similarly to Eq. (1), we also define the unit in the last place (\(\mathsf {ulp}\)) used later in this article. The \(\mathsf {ulp}\) of a number of precision p is defined by

$$\begin{aligned} \mathsf {ulp}(x) = \mathsf {ufp}(x) - p+1. \end{aligned}$$
(2)
Fig. 1.
figure 1

Basic binary IEEE754 formats.

For example, the type of 1.234 is \(\mathtt {real\{+,0,53\}}\) since \(\mathsf {ufp}(1.234)=0\) and since we assume that, by default, the real numbers have the same precision as in the IEEE754 double precision floating-point format [1] (see Fig. 1). Other formats may be specified by the programmer, as in the example below. Let us also mention that our type system is independent of a given computer arithmetic. The interpreter only needs to implement the formats given by the type system, using floating-point numbers, fixed-point numbers [10], multiple precision numbersFootnote 1, etc. in order to ensure that the finite precision operations are computed exactly. The special case of IEEE754 floating-point arithmetic, which introduces additional errors due to the roundoff on results of operations can also be treated by modifying slightly the equations of Sect. 3.

figure a

Notice that, in Numl, the type information is used by the pretty printer to display only the correct digits of a number and a bound on the roundoff error.

Note that accuracy is not a property of a number but a number that states how closely a particular finite-precision number matches some ideal true value. For example, using the basis \(\beta =10\) for the sake of simplicity, the floating-point value 3.149 represents \(\pi \) with an accuracy of 3. It itself has a precision of 4. It represents the real number 3.14903 with an accuracy of 4. As in ML, our type system admits parameterized types [15].

figure b

In the example above, the type of f is a function of an argument whose parameterized type is \(\mathtt {real\{\texttt {'a},\texttt {'b},\texttt {'c}\}}\), where ’a, ’b and ’c are three type variables. The return type of the function f is Real{\(e_0\),\(e_1\),\(e_2\)} where \(e_0\), \(e_1\) and \(e_2\) are arithmetic expressions containing the variables ’a, ’b and ’c. By default these expressions are not displayed by the system (just like higher order values are not explicitly displayed in ML implementations) but we may enforce the system to print them. In Numl, we write +, -, * and / for the operators over real numbers. Integer expressions have type int and we write +_, -_, *_ and /_ for the elementary operators over integers. The expressions arising in the type of f are explained in Sect. 3. As shown below, various applications of f yield results of various types, depending on the type of the argument.

figure c

If the interpreter detects that the result of some computation has no significant digit, then an error is raised. For example, it is well-known that in IEEE754 double precision \((10^{16} + 1)-10^{16}=0\). Our type system rejects this computation.

figure d

Last but not least, our type system accepts recursive functions. For example, we have:

figure e

In the above session, the type system unifies the return type of the function with the type of the conditional. The types of the then and else branches also need to be unified. Then the return type is real{+,0,53} which corresponds to the type of the value 1.0 used in the then branch. The type system also unifies the return type with the type of the argument since the function is recursive. Finally, we obtain that the type of g is real{+,0,53} -> real{+,0,53}. As a consequence, we cannot call g with an argument whose ufp is greater than \(\mathsf {ufp}(1.0)=0\). To overcome this limitation, we introduce new comparison operations for real numbers. While the standard comparison operator \(\texttt {<}\) has type ’a -> ’a -> bool, the operator \(\texttt {<\{s,u,p\}}\) has type real{s,u,p} -> real{s,u,p} -> bool. In other words, the compared value are cast in the format \(\{s,u,p\}\) before performing the comparison. Now we can write the code:

figure f

Interestingly, unstable functions (for which the initial errors grow with the number of iterations) are not typable. This is a desirable property of our system.

figure g

Stable computations should be always accepted by our type system. Obviously, this is not the case and, as any non-trivial type system, our type system rejects some correct programs. The challenge is then to accept enough programs to be useful from an end-user point of view. We end this section by showing another example representative of what our type system accepts. More examples are given later in this article, in Sect. 5. The example below deals with the implementation of the Taylor series The implementation gives rise to a simple recursion, as shown in the programming session below.

figure h

Obviously, our type system computes the propagation of the errors due to finite precision but does not take care of the method error intrinsic to the implemented algorithm (the Taylor series instead of the exact formula \(\frac{1}{1-x}\) in our case.) All the programming sessions introduced above as well as the additional examples of Sect. 5 are fully interactive in our system, Numl, i.e. the type judgments are obtained instantaneously (about 0.01 s in average following our measurements) including the most complicated ones.

3 The Type System

In this section, we introduce the formal definition of our type system for numerical accuracy. First, in Sect. 3.1, we define the syntax of expressions and types and we introduce a set of inference rules. Then we define in Sect. 3.2 the types of the primitives for the operators among real numbers (addition, product, etc.) These types are crucial in our system since they encode the propagation of the numerical accuracy information.

3.1 Expressions, Types and Inference Rules

In this section, we introduce the expressions, types and typing rules for our language. For the sake of simplicity, the syntax introduced hereafter uses notations à la lambda calculus instead of the ML-like syntax employed in Sect. 2. In our system, expressions and types are mutually dependent. They are defined inductively using the grammar of Eq. (3).

$$\begin{aligned} \begin{array}{rcl} \mathsf {Expr}\ni e&{}{:}\,\!{:}\!\!=&{}\ \mathtt {r}\{s,u,p\}\in \mathsf {Real}_{u,p}\ |\ \mathtt {i}\in \mathsf {Int} \ |\ \mathtt {b}\in \mathsf {Bool}\ |\ \mathtt {id}\in \mathsf {Id} \\ &{}&{}|\ \mathtt {if}\ e_0\ \mathtt {then}\ e_1\ \mathtt {else}\ e_2 \ |\ \lambda x.e \ |\ e_0\ e_1\ |\ \mathtt {rec}\ f\ x.e\ |\ t\\ &{}&{}\\ \mathsf {Typ}\ni t&{}{:}\,\!{:}\!\!=&{}|\ \mathtt {int}\ |\ \mathtt {bool}\ |\ \mathtt {real\{}i_0,i_1,i_2\mathtt {\}}\ |\ \alpha \ |\ \varPi x:e_0.e_1\\ &{}&{}\\ \mathsf {IExp}\ni i&{}{:}\,\!{:}\!\!=&{}|\ \mathtt {int}\ |\ \mathtt {op}\in \mathsf {Id}_\mathsf {I}\ |\ \alpha \ |\ i_0\ i_1 \end{array} \end{aligned}$$
(3)

In Eq. (3), the e terms correspond to expressions. Constants are integers \(\mathtt {i}\in \mathsf {Int}\), booleans \(\mathtt {b}\in \mathsf {Bool}\) and real numbers \(\mathtt {r}\{s,u,p\}\) where \(\mathsf {r}\) is the value itself, \(s\in \mathsf {Sign}\) is the sign as defined in Sect. 2 and \(u,p\in \mathsf {Int}\) the \(\mathsf {ufp}\) (see Eq. (1)) and precision of \(\mathtt {r}\). For inputs, the precision p is given by the user by means of annotations or chosen by default by the system. Then p is inferred for the outputs of programs. The term p defines the precision of \(\mathsf {r}\). Let \(\varepsilon (\mathsf {r})\) be the absolute error on \(\mathsf {r}\), we assume that

$$\begin{aligned} \varepsilon (\mathsf {r})< 2^{u-p+1}. \end{aligned}$$
(4)

The errors on the numerical constants arising in programs are specified by the user or determined by default by the system. The errors on the computed values can be inferred by propagation of the initial errors.

Fig. 2.
figure 2

Typing rules for our language.

In Eq. (3), identifiers belong to the set \(\mathsf {Id}\) and we assume a set of pre-defined identifiers \(+\), −, \(\times \), \(\le \), \(=\), \(\ldots \) related to primitives for the logical and arithmetic operations. We write \(+\), −, \(\times \) and \(\div \) the operations on real numbers and \(+\_\), \(-\_\), \(\times \_\) and \(\div \_\) the operations among integers. The language also admits conditionals, functions \(\lambda x.e\), applications \(e_0\ e_1\) and recursive functions \(\texttt {rec}\ f\ x.e\) where f is the name of the function, x the parameter and e the body. The language of expressions also includes type expressions t defined by the second production of the grammar of Eq. (3).

The definition of expressions and type is mutually recursive. Type variables are denoted \(\alpha \), \(\beta \), \(\ldots \) and \(\varPi x:e_0.e_1\) is used to introduce dependent types [16]. Let us notice that our language does not explicitly contain function types \(t_0 \rightarrow t_1\) since they are encoded by means of dependent types. Let \(\equiv \) denote the syntactic equivalence, we have

$$\begin{aligned} t_0\rightarrow t_1 \equiv \varPi x:t_0.t_1\quad \text {with}\ x\ \text {not free in}\ t_1. \end{aligned}$$
(5)

For convenience, we also write \(\lambda x_0.x_1\ldots x_n.e\) instead of \(\lambda x_0.\lambda x_1\ldots \lambda x_n.e\) and \(\varPi x_0:t_0.x_1:t_1\ldots x_n:t_n.e\) instead of \(\varPi x_0:t_0.\varPi x_1:t_1\ldots \varPi x_n:t_n.e\).

The types of constants are \(\mathtt {int}\), \(\mathtt {bool}\) and \(\mathtt {real\{}i_0,i_1,i_2\mathtt {\}}\) where \(i_0\), \(i_1\) and \(i_2\) are integer expressions denoting the format of the real number. Integer expressions of \(\mathsf {IExpr}\subseteq \mathsf {Expr}\) are a subset of expressions made of integer numbers, integer primitives of \(\mathsf {Id}_\mathsf {I}\subseteq \mathsf {Id}\) (such as \(+\_\), \(\times \_\), etc.), type variables and applications. Note that this definition restricts significantly the set of expressions which may be written inside real types.

Fig. 3.
figure 3

The sub-typing relation \(\sqsubseteq \) of Eq. (6).

The typing rules for our system are given in Fig. 2. These rules are mostly classical. The type judgment \(\varGamma \vdash e : t\) means that in the type environment \(\varGamma \), the expression e has type t. A type environment \(\varGamma : \mathsf {Id}\rightarrow \mathsf {Typ}\) maps identifiers to types. We write \(\varGamma \ x:t\) the environment \(\varGamma \) in which the variable x has type t. The typing rules (Int) and (Bool) are trivial. Rule (Real) states that the type of a real number \(\texttt {r\{s,u,p\}}\) is \(\mathtt {real\{s,u,p\}}\) assuming that the actual sign of r is less than s and that the ufp of r is less than u. Following Rule (Id), an identifier id has type t if \(\varGamma (\mathtt {id})=t\). Rules (Cond), (Abs) and (Rec) are standard rules for conditionals and abstractions respectively. The rule for application, (App), requires that the first expression \(e_1\) has type \(\varPi x: t_0. t_1\) (which is equivalent to \(t_0\rightarrow t_1\) if x is not free in \(t_1\)) and that the argument \(e_2\) has some type \(t_2\sqsubseteq t_0\). The sub-typing relation \(\sqsubset \) is introduced for real numbers. Intuitively, we want to allow the argument of some function to have a smaller ulp than what we would require if we used \(t_0=t_2\) in Rule (App), provided that the precision p remains as good with \(t_2\) as with \(t_0\). This relaxation allows to type more terms without invalidating the type judgments. Formally, the relation \(\sqsubseteq \) is defined by

$$\begin{aligned} \mathtt {real\{s_1,u_1,p_1\}} \sqsubseteq \mathtt {real\{s_2,u_2,p_2\}} \iff s_1\sqsubseteq s_2\ \wedge \ u_2 \ge u_1\ \wedge \ p_2 \le u_2 - u_1 + p_1. \end{aligned}$$
(6)

In other words, the sub-typing relation of Eq. (6) states that it is always correct to add zeros before the first significant digit of a number, as illustrated in Fig. 3.

3.2 Types of Primitives

In this section, we introduce the types of the primitives of our language. As mentioned earlier, the arithmetic and logic operators are viewed as functional constants of the language. The type of a primitive for an arithmetic operation among integers \(*\_ \in \{+\_, -\_,\times \_,\div \_\}\) is

$$\begin{aligned} t_{*\_}=\varPi x:\texttt {int}.y:\texttt {int}.\texttt {int}. \end{aligned}$$
(7)

The type of comparison operators \(\bowtie \in \{=,\not =,<,>,\le ,\ge \}\) are polymorphic with the restriction that they reject the type \(\mathtt {real\{s,u,p\}}\) which necessitates special comparison operators:

$$\begin{aligned} t_{\bowtie }=\varPi x:\alpha .y:\alpha .\texttt {bool}\quad \alpha \not =\mathtt {real\{s,u,p\}}. \end{aligned}$$
(8)

For real numbers, we use comparisons at a given accuracy defined by the operators \(\bowtie _{\{u,p\}}\in \{<_{\{u,p\}},>_{\{u,p\}}\}\). We have

$$\begin{aligned} t_{\bowtie _{\{u,p\}}} = \varPi \mathtt {s}:\texttt {int},\mathtt {u}:\texttt {int}, \mathtt {p}:\texttt {int}. \mathtt {real\{s,u,p+1\}}\rightarrow \mathtt {real\{s,u,p+1\}} \rightarrow \mathtt {bool} . \end{aligned}$$

Notice that the operands of a comparison \(\bowtie _{\{u,p\}}\) must have \(p+1\) bits of accuracy. This is to avoid unstable tests, as detailed in the proof of Lemma 3 in Sect. 4. An unstable test is a comparison between two approximate values such that the result of the comparison is altered by the approximation error. For instance, if we reuse an example of Sect. 2, in IEEE754 double precision, the condition \(10^{16}+1=10^{16}\) evaluates to true. We need to avoid such situations in our language in order to preserve our subject reduction theorem (we need the control-flow be the same in the finite precision and exact semantics). Let us also note that our language does not provide an equality relation \(=_{\{u,p\}}\) for real values. Again this is to avoid unstable tests. Given values x and y of type \(\mathtt {real\{s,u,p\}}\), the programmer is invited to use \(|x-y|<2^{u-p+1}\) instead of \(x=y\) in order to get rid of the perturbations of the finite precision arithmetic.

Fig. 4.
figure 4

Types of the primitives corresponding to the elementary arithmetic operations \(*\in \{+,-,\times ,\div \}\). The functions \(\mathcal {S}_*\) and \(\sigma _*\) are defined in Fig. 5.

The types of primitives for real arithmetic operators are fundamental in our system since they encode the propagation of the numerical accuracy information. They are defined in Figs. 4 and 5. The type \(t_*\) of some operation \(*\in \{+,-,\times ,\div \}\) is a pi-type with takes six arguments \(\mathsf {s_1,u_1,p_1,s_2,u_2}\) and \(\mathsf {p_2}\) of type int corresponding to the sign, ufp and precision of the two operands of \(*\) and which produces a type \(\mathtt {real\{s_1,u_1,p_1\}}\rightarrow \mathtt {real\{s_2,u_2,p_2\}} \rightarrow \mathtt {real}\{ {\mathcal {S}_*(\mathtt {s_1},\mathtt {s_2})},\) \( {\mathcal {U}_*(\mathtt {s_1},\mathtt {u_1},\mathtt {s_2},\mathtt {u_2})}, {\mathcal {P}_*(\mathtt {u_1},\mathtt {p_1},\mathtt {u_2},\mathtt {p_2})}\} \) where \(\mathcal {S}_*\), \(\mathcal {U}_*\) and \(\mathcal {P}_*\) are functions which compute the sign, ufp and precision of the result of the operation \(*\) in function of \(\mathsf {s_1,u_1,p_1,s_2,u_2}\) and \(\mathsf {p_2}\). These functions extend the functions used in [12].

The functions \(\mathcal {S}_*\) determine the sign of the result of an operation in function of the signs of the operands and, for additions and subtractions, in function of the ufp of the operands. The functions \(\mathcal {U}_*\) compute the ufp of the result. Notice that \(\mathcal {U}_+\) and \(\mathcal {U}_-\) use the functions \(\sigma _+\) and \(\sigma _-\), respectively. These functions are defined in the bottom right corner of Fig. 5 to increment the ufp of the result of some addition or subtraction in the relevant cases only. For example if a and b are two positive real numbers then \(\mathsf {ufp}(a+b)\) is possibly \(\max \big (\mathsf {ufp}(a),\mathsf {ufp}(b)\big )+1\) but if \(a>0\) and \(b<0\) then \(\mathsf {ufp}(a+b)\) is not greater than \(\max \big (\mathsf {ufp}(a),\mathsf {ufp}(b)\big )\). The functions \(\mathcal {P}_*\) compute the precision of the result. Basically, they compute the number of bits between the ufp and the ulp of the result.

Fig. 5.
figure 5

Operators used in the types of the primitives of Fig. 4.

We end this section by exhibiting some properties of the functions \(\mathcal {P}_*\). Let \(\varepsilon (x)\) denote the error on \(x\in \mathsf {Real}_{u,p}\). We have \(\varepsilon (x)< 2^{u-p+1}=\mathsf {ulp}(x)\). Let us start with addition. Lemma 1 relates the accuracy of the operands to the accuracy of the result of an addition between two values x and y. Lemma 2 is similar to Lemma 1 for product.

Lemma 1

Let x and y be two values such that \(\varepsilon (x)< 2^{u_1-p_1+1}\) and \(\varepsilon (y)< 2^{u_2-p_2+1}\). Let \(z=x+y\), \(u=\mathcal {U}_+(s_1,u_1,s_2,u_2)\) and \( p=\mathcal {P}_+(s_1,u_1,p_1,s_2,u_2,p_2)\). Then \(\varepsilon (z)<2^{u-p+1}\).

Proof

The errors on addition may be bounded by \(e_+=\varepsilon (x)+\varepsilon (y)\). Then the most significant bit of the error has weight \(\mathsf {ufp}(e_+)\) and the accuracy of the result is \(p=\mathsf {ufp}(x + y)-\mathsf {ufp}(e_+)\). Let \( u=\mathsf {ufp}(x+y)=\max (u_1,u_2)+\sigma _+(s_1,s_2)=\mathcal {U}_+(s_1,u_1,s_2,u_2). \) We need to over-approximate \(e_+\) in order to ensure p. We have \( \varepsilon (x)< 2^{u_1-p_1+1}\) and \(\varepsilon (y)< 2^{u_2-p_2+1} \) and, consequently, \( e_+ < 2^{u_1-p_1+1}+ 2^{u_2-p_2+1}. \) We introduce the function \(\iota (x,y)\) also defined in Fig. 4 and which is equal to 1 if \(x=y\) and 0 otherwise. We have

$$ \begin{array}{rcl} \mathsf {ufp}(e_+) &{}<&{} \max (u_1-p_1+1,u_2-p_2+1)+\iota (u_1-p_1,u_2-p_2) \\ &{}\le &{} \max (u_1-p_1,u_2-p_2)+\iota (u_1-p_1,u_2-p_2) \end{array}$$

Let us write \(p = \max (u_1-p_1,u_2-p_2)-\iota (u_1-p_1,u_2-p_2)=\mathcal {P}_+(s_1,u_1,p_1s_2,u_2,p_2)\). We conclude that \(u=\mathcal {U}_+(s_1,u_1,s_2,u_2)\), \(p=\mathcal {P}_+(s_1,u_1,p_1s_2,u_2,p_2)\) and \(\varepsilon (z) < 2^{u-p+1}\).    \(\Box \)

Lemma 2

Let x and y be two values such that \(\varepsilon (x)< 2^{u_1-p_1+1}\) and \(\varepsilon (y)< 2^{u_2-p_2+1}\). Let \(z=x\times y\), \(u=\mathcal {U}_\times (s_1,u_1,s_2,u_2)\) and \( p=\mathcal {P}_\times (s_1,u_1,p_1,s_2,u_2,p_2)\). Then \(\varepsilon (z)<2^{u-p+1}\).

Proof

For product, we have \(p = \mathsf {ufp}(x\times y)-\mathsf {ufp}(e_\times )\) with \(e_\times =x\cdot \varepsilon (y)+y\cdot \varepsilon (x)+\varepsilon (x)\cdot \varepsilon (y)\). Let \( u=u_1+u_2+1=\mathcal {U}_\times (s_1,u_1,s_2,u_2) \). We have, by definition of \(\mathsf {ufp}\), \( 2^{u_1} \le x< 2^{u_1+1}\quad \text {and}\quad 2^{u_2} \le y< 2^{u_2+1}.\) Then \(e_\times \) may be bounded by

$$\begin{aligned} \begin{array}{rcl} e_\times &{}<&{} 2^{u_1+1}\cdot 2^{u_2-p_2+1} + 2^{p_2+1}\cdot 2^{u_1-p_1+1}+ 2^{u_1-p_1+1}\cdot 2^{u_2-p_2+1}\\ &{}=&{} 2^{u_1+u_2-p_2+2}+ 2^{u_1+u_2-p_1+2}+ 2^{u_1+u_2-p_1-p_2+2}. \end{array} \end{aligned}$$
(9)

Since \(u_1+u_2-p_1-p_2+2< u_1+u_2-p_1+2\) and \(u_1+u_2-p_1-p_2+2< u_1+u_2-p_2+2\), we may get rid of the last term of Eq. (9) and we obtain that

$$\begin{aligned} \begin{array}{rcl} \mathsf {ufp}(e_\times ) &{}<&{} \max (u_1+u_2-p_1+2,u_1+u_2-p_2+2)+\iota (p_1,p_2)\\ &{}\le &{} \max (u_1+u_2-p_1+1,u_1+u_2-p_2+1)+\iota (p_1,p_2). \end{array} \end{aligned}$$

Let us write \( p = \max (u_1+u_2-p_1+1,u_1+u_2-p_2+1)-\iota (p_1,p_2)=\mathcal {P}_\times (s_1,u_1,p_1s_2,u_2,p_2)\). Then \(u=\mathcal {U}_\times (s_1,u_1,s_2,u_2)\), \(p=\mathcal {P}_\times (s_1,u_1,p_1s_2,u_2,p_2)\) and \(\varepsilon (z) < 2^{u-p+1}\).    \(\Box \)

Note that, by reasoning on the exponents of the values, the constraints resulting from a product become linear. The equations for subtraction and division are almost identical to the equations for addition and product, respectively. We conclude this section with the following theorem which summarize the properties of the types of the result of the four elementary operations.

Theorem 1

Let x and y be two values such that \(\varepsilon (x)< 2^{u_1-p_1+1}\) and \(\varepsilon (y)< 2^{u_2-p_2+1}\) and let \(*\in \{ +,-,\times ,\div \}\) be an elementary operation. Let \(z=x*y\), \( u=\mathcal {U}_*(s_1,u_1,s_2,u_2)\) and \( p=\mathcal {P}_*(s_1,u_1,p_1,s_2,u_2,p_2)\). Then \(\varepsilon (z)<2^{u+p-1}\).

Proof

The cases of addition and product correspond to Lemmas 1 and 2, respectively. The cases of subtraction and division are similar.    \(\Box \)

Numl uses a modified Hindley-Milner type inference algorithm. Linear constraints among integers are generated (even for non linear expressions). They are solved space limitation reasons, the details of this algorithm are out of the scope of this article.

Fig. 6.
figure 6

Operational semantics for our language.

4 Soundness of the Type System

In this section, we introduce a subject reduction theorem proving the consistency of our type system. We use two operational semantics \(\rightarrow _\mathbb {F}\) and \(\rightarrow _\mathbb {R}\) for the finite precision and exact arithmetics, respectively. The exact semantics is used for proofs. Obviously, in practice, only the finite precision semantics is implemented. We write \(\rightarrow \) whenever a reduction rule holds for both \(\rightarrow _\mathbb {F}\) and \(\rightarrow _\mathbb {R}\) (in this case, we assume that the same semantics \(\rightarrow _\mathbb {F}\) or \(\rightarrow _\mathbb {R}\) is used in the lower and upper parts of the same sequent). Both semantics are displayed in Fig. 6. They concern the subset of the language of Eq. (3) which do not deal with types.

$$\begin{aligned} \begin{array}{rcl} \mathsf {EvalExpr}\ni e&{}{:}{:=}&{}\ \mathtt {r}\{s,u,p\}\in \mathsf {Real}_{u,p}\ |\ \mathtt {i}\in \mathsf {Int} \ |\ \mathtt {b}\in \mathsf {Bool}\ |\ \mathtt {id}\in \mathsf {Id} \\ &{}&{}|\ \mathtt {if}\ e_0\ \mathtt {then}\ e_1\ \mathtt {else}\ e_2 \ |\ \lambda x.e \ |\ e_0\ e_1\ |\ \mathtt {rec}\ f\ x.e |\ e_0\ *\ e_1 \end{array}. \end{aligned}$$
(10)

In Eq. (10), \(*\) denotes an arithmetic operator \(*\in \{+,-,\times ,\div ,+\_,-\_,\times \_,\) \(\div \_\}\). In Fig. 6, Rule (FVal) of \(\rightarrow _\mathbb {F}\) transforms a syntactic element describing a real number \(\mathtt {r\{s,u,p\} }\) in a certain format into a value \(v_\mathbb {F}\). The finite precision value \(v_\mathbb {F}\) is an approximation of \(\mathtt {r}\) with an error less than the ulp of \(\mathtt {r\{s,u,p\}}\). In the semantics \(\rightarrow _\mathbb {R}\), the real number \(\mathtt {r\{s,u,p\} }\) simply produces the value r without any approximation by Rule (RVal). Rules (Op1) and (Op2) evaluate the operands of some binary operation and Rule (Op) performs an operation \(*\in \{+,-,\times ,\div ,+\_,-\_,\times \_,\div _\_\}\) between two values \(v_0\) and \(v_1\).

Rules (Cmp1), (Cmp2) and (ACmp) deal with comparisons. They are similar to Rules (Op1), (Op2) and (Op) described earlier. Note that the operators \(<,\ >,\ =,\ \not =\) concerned by Rule (ACmp) are polymorphic except that they do not accept arguments of type real. Rules (FCmp) and (RCmp) are for the comparison of real values. Rule (FCmp) is designed to avoid unstable tests by requiring that the distance between the two compared values is greater than the ulp of the format in which the comparison is done. With this requirement, a condition cannot be invalidated by the roundoff errors. Let us also note that, with this definition, \(x<_{u,p} y \not \Rightarrow y>_{u,p} x\) or \(x>_{u,p} y \not \Rightarrow y<_{u,p} x\). For the semantics \(\rightarrow _\mathbb {R}\), Rule (RCmp) simply compares the exact values.

The other rules are standard and are identical in \(\rightarrow _\mathbb {F}\) and \(\rightarrow _\mathbb {R}\). Rules (App1), (App2) and (Red) are for applications and Rule (Rec) is for recursive functions. We write \(e\langle v/x\rangle \) the term e in which v has been substituted to the free occurrences of x. Rules (Cond), (CondTrue) and (CondFalse) are for conditionals.

Fig. 7.
figure 7

Simulation relation \(\models \) used in our subject reduction theorem.

The rest of this section is dedicated to our subject reduction theorem. First of all, we need to relate the traces of \(\rightarrow _\mathbb {F}\) and \(\rightarrow _\mathbb {R}\). We introduce new judgments

$$\begin{aligned} \varGamma \models (e_\mathbb {F},e_\mathbb {R})\ :\ t. \end{aligned}$$
(11)

Intuitively, Eq. (11) means that expression \(e_\mathbb {F}\) simulates \(e_\mathbb {R}\) up to accuracy t. In this case, \(e_\mathbb {F}\) is syntactically equivalent to \(e_\mathbb {R}\) up to the values which, in \(e_\mathbb {F}\), are approximations of the values of \(e_\mathbb {R}\). The value of the approximation is given by type t.

Formally, \(\models \) is defined in Fig. 7. These rules are similar to the typing rules of Fig. 2 excepted that they operate on pairs \((e_\mathbb {F},e_\mathbb {R})\). They are also designed for the language of Eq. (10) and, consequently, deal with the elementary arithmetic operations \(+,\ -,\ \times \) and \(\div \) as well as the comparison operators. The difference between the rules of Figs. 2 and  7 is in Rule (VReal) which states that a real value \(v_\mathbb {R}\) is correctly simulated by a value \(v_\mathbb {F}\) up to accuracy \(\mathtt {real\{s,u,p\}}\) if \(|v_\mathbb {R}-v_\mathbb {F}|<2^{u-p+1}\). It is easy to show, by examination of the rules of Figs. 2 and 7 that

$$\begin{aligned} \varGamma \models (e_\mathbb {F},e_\mathbb {R})\ :\ t\ \Longrightarrow \varGamma \vdash e_\mathbb {F}\ :\ t. \end{aligned}$$
(12)

We introduce now Lemma 3 which asserts the soundness of the type system for one reduction step. Basically, this lemma states that types are preserved by reduction and that concerning the values of type real, the distance between the finite precision value and the exact value is less than the ulp given by the type.

Lemma 3

(Weak subject reduction). If \(\varGamma \models (e_\mathbb {F},e_\mathbb {R})\ : t\) and if \(e_\mathbb {F}\rightarrow _\mathbb {F}e_\mathbb {F}'\) and \(e_\mathbb {R}\rightarrow _\mathbb {R}e_\mathbb {R}'\) then \(\varGamma \models (e_\mathbb {F}',e_\mathbb {R}')\ : t\).

Proof

By induction on the structure of expressions and case examination on the possible transition rules of Fig. 6.

  • If \(e_\mathbb {F}\equiv e_\mathbb {R}\equiv \mathtt {r\{s,u,p\}}\) then \(\varGamma \models (\texttt {r\{s,u,p\}},\texttt {r\{s,u,p\}})\ :\ \mathtt {real\{s,u,p\}} \) and, from the reduction rules (FVal) and (RVal) of Fig. 6, \(\mathtt {r\{s,u,p\}} \rightarrow _\mathbb {F}v_\mathbb {F}\) and \(\mathtt {r\{s,u,p\}} \rightarrow _\mathbb {R}v_\mathbb {R}\) with \(|v_\mathbb {F}-v_\mathbb {R}|< 2^{u-p+1}\). So \(\varGamma \models (v_\mathbb {F},v_\mathbb {R})\ :\ \mathtt {real\{s,u,p\}} \).

  • If \(e_\mathbb {F}\equiv e_{0\mathbb {F}}*e_{1\mathbb {F}}\) and \(e_\mathbb {R}\equiv e_{0\mathbb {R}}*e_{1\mathbb {R}}\) then several cases must be distinguished.

    • If \(e_\mathbb {F}\equiv v_{0\mathbb {F}}\ *\ v_{1\mathbb {F}}\) and \(e_\mathbb {R}\equiv v_{0\mathbb {R}}\ *\ v_{1\mathbb {R}}\) then, by induction hypothesis, \(\varGamma \models (v_{0\mathbb {F}},v_{0\mathbb {R}})\ :\mathtt {real\{s_0,u_0,p_0\}}\), \(\varGamma \models (v_{1\mathbb {F}},v_{1\mathbb {R}})\ :\mathtt {real\{s_1,u_1,p_1\}}\) and, consequently, from Rule (VReal),

      $$\begin{aligned} |v_{0\mathbb {R}}-v_{0\mathbb {F}}|<2^{u_0-p_0+1}\quad \text {and}\quad |v_{1\mathbb {R}}-v_{1\mathbb {F}}|<2^{u_1-p_1+1}. \end{aligned}$$
      (13)

      Following Fig. 4, the type t of e is

      $$ \begin{array}{rcl} t&{}=&{} \big (\varPi \mathtt {s_1}:\texttt {int},\mathtt {u_1}:\texttt {int}, \mathtt {p_1}:\texttt {int}, \mathtt {s_2}:\texttt {int},\mathtt {u_2}:\texttt {int}, \mathtt {p_2}:\texttt {int}.\\ &{}&{}\quad \mathtt {real\{s_1,u_1,p_1\}}\rightarrow \mathtt {real\{s_2,u_2,p_2\}}\rightarrow \\ &{}&{}\quad \rightarrow \mathtt {real\{\mathcal {S}_*(\mathtt {s_1},\mathtt {u_1},\mathtt {s_2},\mathtt {u_2}),\mathcal {U}_*(\mathtt {s_1},\mathtt {u_1},\mathtt {s_2},\mathtt {u_2}),\mathcal {P}_*(\mathtt {s_1},\mathtt {u_1},\mathtt {p_1},\mathtt {s_2},\mathtt {u_2},\mathtt {p_2})\}}\\ &{}&{}\big )\ \mathtt {s_1\ u_1\ p_1\ s_2\ u_2\ p_2},\\ &{}=&{}\mathtt {real\{\mathcal {S}_*(\mathtt {s_1},\mathtt {u_1},\mathtt {s_2},\mathtt {u_2}),\mathcal {U}_*(\mathtt {s_1},\mathtt {u_1},\mathtt {s_2},\mathtt {u_2}),\mathcal {P}_*(\mathtt {s_1},\mathtt {u_1},\mathtt {p_1},\mathtt {s_2},\mathtt {u_2},\mathtt {p_2})\}}\\ &{}=&{}\mathtt {real\{s,u,p\}} \end{array} $$

      By Rule (Op), \(e\rightarrow _\mathbb {F}v_\mathbb {F}\) and \(e\rightarrow _\mathbb {R}v_\mathbb {R}\) and, by Theorem 1, with the assumptions of Eq. (13), we know that \(|v_\mathbb {R}-v_\mathbb {F}|<2^{u-p+1}\). Consequently, \(\varGamma \models (v_\mathbb {F},v_\mathbb {R})\ :\ \mathtt {real\{s,u,p\}}\).

    • If \(e_\mathbb {F}\equiv v_{0\mathbb {F}}\ *\ v_{1\mathbb {F}}\) and \(e_\mathbb {R}\equiv v_{0\mathbb {R}}\ *\ v_{1\mathbb {R}}\) with \(\varGamma \models (v_0, v_1)\,\ \mathtt {int}\) then, by Rule (Op), \(e\rightarrow (v,v)\) and, by Eq. (7), \(\varGamma \vdash v\,\ \mathtt {int}\). If \(e\equiv e_0*e_1\) then, by Rule \(\textsf {(Op1)}, e\rightarrow e_0*e_1'\) and we conclude by induction hypothesis. The case \(e\equiv e_0*\ v_1\) is similar to the former one.

  • If \(e_\mathbb {F}\equiv e_{0\mathbb {F}}\bowtie _{u,p} e_{1\mathbb {F}}\) and \(e_\mathbb {R}\equiv e_{0\mathbb {R}}\bowtie _{u,p} e_{1\mathbb {R}}\) then several cases have to be examined.

    • If \(e_\mathbb {F}\equiv v_{0\mathbb {F}}\bowtie _{u,p} v_{1\mathbb {F}}\) and \(e_\mathbb {R}\equiv v_{0\mathbb {R}}\bowtie _{u,p} v_{1\mathbb {R}}\) then by rules (FCmp) and (RCmp) \(e_\mathbb {F}\rightarrow _\mathbb {F}b_\mathbb {F}\), \(e_\mathbb {R}\rightarrow _\mathbb {R}b_\mathbb {R}\) with \(b_\mathbb {F}= v_{0\mathbb {F}}- v_{1\mathbb {F}} \bowtie _{\{u,p\}} 2^{u-p+1}\) and \(b_\mathbb {R}= v_{0\mathbb {R}} -v_{1\mathbb {R}} \bowtie _{\{u,p\}} 0\). By rule (RCmp) of Fig. 7, \(\varGamma \models (v_{0\mathbb {F}},v_{1\mathbb {F}})\,\ \mathtt {real\{s,u,p\}}\) and \(\varGamma \models (v_{0\mathbb {R}},v_{1\mathbb {R}})\,\ \mathtt {real\{s,u,p\}}\). Consequently, \( |v_{0\mathbb {R}}-v_{0\mathbb {F}}| < 2^{u-p+1}\) and \(|v_{1\mathbb {R}}-v_{1\mathbb {F}}| < 2^{u-p+1}\). By combining the former equations, we obtain that \( |(v_{0\mathbb {R}}-v_{1\mathbb {R}})-(v_{0\mathbb {F}}-v_{1\mathbb {F}})| < 2^{u-p}.\) Consequently, \(b_\mathbb {F}=b_\mathbb {R}\) and we conclude that \(\varGamma \models (b_\mathbb {F},b_\mathbb {R})\, \mathtt {bool}\).

    • The other cases for \(e_\mathbb {F}\equiv e_{0\mathbb {F}}\bowtie _{u,p} e_{1\mathbb {F}}\) are similar to the cases \(e_\mathbb {F}\equiv v_{0\mathbb {F}}\,*\, v_{1\mathbb {F}}\) examined previously.

  • The other cases simply follow the structure of the terms, by application of the induction hypothesis.\(\Box \)

Let \(\rightarrow _\mathbb {F}^*\) (resp. \(\rightarrow _\mathbb {R}^*\)) denote the reflexive transitive closure of \(\rightarrow _\mathbb {F}\) (resp. \(\rightarrow _\mathbb {R}\)). Theorem 2 expresses the soundness of our type system for sequences of reduction of arbitrary length.

Theorem 2

(Subject reduction). If \(\varGamma \models (e_\mathbb {F},e_\mathbb {R}): t\) and if \(e_\mathbb {F}\rightarrow ^*_\mathbb {F}e_\mathbb {F}'\) and \(e_\mathbb {R}\rightarrow ^*_\mathbb {R}e_\mathbb {R}'\) then \(\varGamma \models (e_\mathbb {F}',e_\mathbb {R}'): t\).

Proof

By induction on the length of the reduction sequence, using Lemma 3.\(\Box \)

Theorem 2 asserts the soundness of our type system. It states that the evaluation of an expression of type \(\mathtt {real\{s,u,p\}}\) yields a result of accuracy \(2^{u-p+1}\).

5 Experiments

In this section, we report some experiments showing how our type system behaves in practice. Section 5.1 presents Numl implementations of usual mathematical formulas while Sect. 5.2 introduce a larger example demonstrating the expressive power of our type system.

5.1 Usual Mathematical Formulas

Our first examples concern usual mathematical formulas, to compute the volume of geometrical objects or formulas related to polynomials. These examples aim at showing that usual mathematical formulas are typable in our system. We start with the volume of the sphere and of the cone.

figure i

We repeatedly define the function sphere with more precision in order to show the impact on the accuracy of the results. Note that the results now have 15 digits instead of the former 5 digits.

figure j

The next examples concern polynomials. We start with the computation of the discriminant of a second degree polynomial.

figure k

Our last example concerning usual formulas is the Taylor series development of the sine function. In the code below, observe that the accuracy of the result is correlated to the accuracy of the argument. As mentioned in Sect. 2, error methods are neglected, only the errors due to the finite precision are calculated (indeed, \(\sin \frac{\pi }{8}=0.382683432\ldots \)).

figure l

5.2 Newton-Raphson Method

In this section, we introduce a larger example to compute the zero of a function using the Newton-Raphson method. This example, which involves several higher order functions, shows the expressiveness of our type system. In the programming session below, we first define a higher order function deriv which takes as argument a function and computes its numerical derivative at a given point. Then we define a function g and compute the value of its derivative at point 2.0. Next, by partial application, we build a function computing the derivative of g at any point. Finally, we define a function newton which searches the zero of a function. The newton function is also an higher order function taking as argument the function for which a zero has to be found and its derivative.

figure m

We call the newton function with our function g and its derivative computed by partial application of the deriv function. We obtain a root of our polynomial g with a guaranteed accuracy. Note that while Newton-Raphson method converges quadratically in the reals, numerical errors may perturb the process [4].

6 Conclusion

In this article, we have introduced a dependent type system able to infer the accuracy of numerical computations. Our type system allows one to type non-trivial programs corresponding to implementations of classical numerical analysis methods. Unstable computations are rejected by the type system. The consistency of typed programs is ensured by a subject reduction theorem. To our knowledge, this is the first type system dedicated to numerical accuracy. We believe that this approach has many advantages going from early debugging to compiler optimizations. Indeed, we believe that the usual type float proposed by usual ML implementations, and which is a simple clone of the type int, is too poor for numerical computations. We also believe that this approach is a credible alternative to static analysis techniques for numerical precision [6, 9, 18]. For the developer, our type system introduces few changes in the programming style, limited to giving the accuracy of the inputs of the accuracy of comparisons to allow the typing of certain recursive functions.

A first perspective to the present work is the implementation of a compiler for Numl. We aim at using the type information to select the most appropriate formats (the IEEE754 formats of Fig. 1, multiple precisions numbers of the GMP library when needed or requested by the user or fixed-point numbers.) In the longer term, we also aim at introducing safe compile-time optimizations based on type preservation: an expression may be safely (from the accuracy point of view) substituted to another expression as long as both expressions are mathematically equivalent and that the new expression has a greater type than the older one in the sense of Eq. (6). Finally, a second perspective is to integrate our type system into other applicative languages. In particular, it would be of great interest to have such a type system inside a language used to build critical embedded systems such as the synchronous language Lustre [3]. In this context numerical accuracy requirements are strong and difficult to obtain. Our type system could be integrated naturally inside Lustre or similar languages.