A Reduced Product of Absolute and Relative Error Bounds for Floating-Point Analysis

Jacquemin, Maxime; Putot, Sylvie; Védrine, Franck

doi:10.1007/978-3-319-99725-4_15

Maxime Jacquemin¹⁴,
Sylvie Putot¹⁵ &
Franck Védrine¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 11002))

Included in the following conference series:

International Static Analysis Symposium

1021 Accesses
5 Citations

Abstract

Rigorous estimation of bounds on errors in finite precision computation has become a key point of many formal verification tools. The primary interest of the use of such tools is generally to obtain worst-case bounds on the absolute errors. However, the natural bound on the elementary error committed by each floating-point arithmetic operation is a bound on the relative error, which suggests that relative error bounds could also play a role in the process of computing tight error estimations. In this work, we introduce a very simple interval-based abstraction, combining absolute and relative error propagations. We demonstrate with a prototype implementation how this simple product allows us in many cases to improve absolute error bounds, and even to often favorably compare with state-of-the art tools, that rely on much more costly relational abstractions or optimization-based estimations.

Access provided by CONRICYT-eBooks. Download conference paper PDF

An Abstract Interpretation Framework for the Round-Off Error Analysis of Floating-Point Programs

Correct approximation of IEEE 754 floating-point arithmetic for program verification

Article Open access 22 February 2022

Affine Arithmetic and Applications to Real-Number Proving

1 Introduction

Computing worst-case bounds on the potential loss of accuracy in numerical programs due to the use of floating-point arithmetic is of utmost importance in many fields of application, such as embedded systems or numerical simulation. Several analyzes for the computation of sound error bounds have been proposed in the last 15 years, and generally implemented in academic prototypes. Most of them rely on abstractions of the value and absolute errors of program variables. An additional output of such analyses is sometimes bounds on the relative errors, but they are mostly computed a posteriori, from the values and absolute errors. Still, the natural bound on the elementary error committed by each floating-point arithmetic operation is a bound on the relative error. This strongly suggests that relative error bounds can also play a role in the process of computing tight error estimates. This is what this work proposes to explore. We indeed note that on some patterns, abstraction relying only on absolute error yields unreasonably conservative error bounds, and that a simple product with relative error bounds can bring a drastic improvement. One such pattern is a conditional statement that tests a quantity subject to a rounding error. We consider the very simple piece of code introduced in Example 1.

Example 1

Variables i, x, y are double precision floating-point numbers, and input i is given without error in range [1, 100]: The multiplication x:=i*i results in variable x in [1,10000] with an elementary absolute rounding error ${\mathcal {E}}_{a}(x)$ bounded in $[-9.09e^{-13},9.09e^{-13}]$. If evaluated directly by using the fact that the elementary error in floating-point arithmetic is bounded in relative error, we obtain a relative error ${\mathcal {E}}_{r}(x)$ bounded in $ [-1.11e^{-16},1.11e^{-16}]$. It is clear that some information is lost if the error is abstracted by bounds on the absolute error only, especially on a non-relational abstraction like intervals (and the wider the intervals, the more so). Take for instance constraint $\texttt {x <= 2.0}$ on our variable x. Using the relative error bound allows us to compute a much tighter absolute error bound in the true branch of the conditional. Indeed, the value of x knowing that the constraint is satisfied can be reduced in [1,2]. Thus, a new absolute error bound for x in this branch can be computed by ${\mathcal {E}}_{a}(x)= {\mathcal {E}}_{a}(x) \cap x{\mathcal {E}}_{r}(x) = [-9.09e^{-13},9.09e^{-13}] \cap [1,2][-1.11e^{-16},1.11e^{-16}] = [-2.22e^{-16},2.22e^{-16}]$. Therefore, the absolute error on variable y will be bounded in $[-2.22e^{-16},2.22e^{-16}]$. Whereas if not using this reduced product, the error on x simply propagates as an error on y, and the absolute error bound on y will be $[-9.09e^{-13},9.09e^{-13}]$.

Another simple example, that focuses on arithmetic operations, is taken from the introduction of [16]:

Example 2

We consider expression $t / (t+1)$, where t is a double precision floating-point value in [0,999]. An error is committed when computing $t+1$: the absolute error of $t+1$ is bounded by ${\mathcal {E}}_{a}(t+1)=[-5.68 e^{-14},5.68 e^{-14}]$, the relative error by ${\mathcal {E}}_{r}(t+1)=[-1.1e^{-16},1.1e^{-16}]$. For comparison, the a posteriori evaluation of the relative error bounds from the absolute error bounds is

$$\begin{aligned} \frac{{\mathcal {E}}_{a}(t+1)}{t+1} = \frac{[-5.68 e^{-14},5.68 e^{-14}]}{[1,1000]}=[-5.68 e^{-14},5.68 e^{-14}] \end{aligned}$$

thus 500 times larger than the direct estimate ${\mathcal {E}}_{r}(t+1)$. Thus, if the relative error is not explicitely propagated, some information is lost.

And indeed, as we will develop in Sect. 3.2, it is natural to express the absolute error on the division $x \div y$ using the relative error on y. Actually, the bounds on the absolute error of $t / (t+1)$ using this product are $[-1.67e^{-13},1.67e^{-13}]$, 340 times tighter compared to the bounds $[-5.68 e^{-11},5.68 e^{-11}]$ that would be obtained by classical error propagation relying only on absolute error. On this example, the method of [16], that relies on optimisation of the error globally on subexpressions, is more accurate than our improved bounds. This is because we still suffer here from the conservativeness of interval abstraction in the evaluation of our expressions. But their results come at the expense of much more expensive computations.

In both cases, we note that this conjunction of the propagation of the relative error and the absolute error, in the end, helps us improve sometimes dramatically the absolute error bound, while maintaining a very cheap analysis. It is indeed the center idea of this work to observe that the information contained in the absolute and the relative error bounds are complementary, and to propose an interval-based analysis computing an inexpensive reduced product that combines the information for the best final estimations of error bounds. The idea has been experimented here on a reduced set of operators, and ignoring the possibility of control flow divergences between the floating-point and the corresponding real computations, as a proof of concept. But the approach can naturally be extended to more operators, as well as to relational abstractions of values and error, in order to enhance many existing error analyzes. Additionally, using relational abstractions is necessary to handle with reasonable accuracy errors due to control flow divergences.

Contents. After some background on floating-point arithmetic in Sect. 2, we introduce our abstraction in Sect. 3. In Sect. 4, we demonstrate that our analysis, implemented in the Frama-C platform, while being very efficient in time, often also favorably compares in accuracy to the generally much more expensive existing approaches of the state of the art [11, 16]. We use for this a set of benchmarks classically used to compare error analyses, extracted from FPBench^{Footnote 1}.

Related Work. Abstract interpretation [2] is widely used for the analysis of floating-point computations. Most analyses dedicated to the propagation of error bounds in floating-point computations focus on absolute round off error bounds. Existing abstraction for rounding errors often are based on intervals [8, 13], affine forms [1, 4, 9, 10] as implemented in the analyzer Fluctuat [7]. The tool Gappa [6] relies on interval arithmetic and expression rewriting. It additionally generates a proof of the bounds it computes, that can be automatically checked with a proof assistant such as Coq. Some approaches combine these abstractions with some optimization techniques to enhance bounds on values and errors. The tool PRECiSA [17], relies on intervals, combined with branch-and-bound optimization and symbolic error computations using the Bernstein basis. It also generates proof certificates on the error bounds. Rosa [5] combines affine arithmetic with some SMT solving. Real2float [12] also bounds absolute rounding errors using optimization techniques relying on semidefinite programming and sparse sums of squares certificates.

Some of the tools based on these methods provide the user with relative error bounds, but they are often a posteriori bounds, computed from the bounds on the absolute error. Direct relative error bounds are computed by FPTaylor [16], which formulates the problem of bounding errors as an optimization problem, using first-order Taylor approximations of arithmetic expressions. The optimization based approach of FPTaylor has been extended in Daisy [11], which also relies on some of the techniques already present in Rosa [5] for value and absolute error estimate. In the present work, we propose a much less costly alternative to the direct estimate of relative error, which we show still behaves very well on a number of classical benchmarks, and demonstrate how this error can be used to also improve absolute error bounds. Compared to the related work which uses optimization somewhat blindly, we demonstrate the interplay between the two types of errors.

2 Floating-Point Arithmetic and Rounding Errors

2.1 Floating-Point Numbers and Rounding Errors

The floating-point representation of a real number x is defined by the IEEE 754 standard as the triple (sgn, sig, exp). In this triple, sgn corresponds to the sign of x, the signific and sig has fixed size p, and, for normalized numbers, is such that $1 \le sig < 2$, and exp is the exponent. This representation is evaluated as ${(-1)}^{sgn} \times sig \times {2}^{exp}$. Denormalized numbers allow gradual underflow to zero. Their exponent is fixed equal to $e_{\min }$, and the signific and is such that $sig <1$.

Because of the finite size of the signific and, a real value is represented by a rounded value. This rounding can be represented through the operator $\mathrm {rnd} : \mathbb {R}\rightarrow \mathbb {F}$ that returns the closest floating-point number with respect to the rounding mode. Common rounding modes defined by the standard are rounding to nearest (ties to even), toward zero and toward $\pm \infty $. In this work, we consider the classical case of rounding to nearest. The rounding operator is often modeled as:

$$\begin{aligned} \mathrm {rnd}(x) = x(1 + e_x) + d_x \end{aligned}$$

(1)

where $|e_x| \le \epsilon _M$, $|d_x| \le \delta _M$, $e_x \times d_x = 0$ and $(\epsilon _M, \delta _M)$ are parameters fixed by the format (simple, double or quad precision). Constant $\epsilon _M$ is often called the machine epsilon and depends of the precision p of the floating-point numbers used. It is equal to the distance $2^{1-p}$ between 1 and its floating-point successor, with $p=24$ for float and $p=53$ for double numbers. Constant $\delta _M$ is the smallest denormalized number, equal to $2^{e_{\min }+1-p}$, with $e_{\min }=-127$ for float and $e_{\min }=-1023$ for double numbers. In this model, $d_x$ represents the absolute error committed when rounding to a denormalized floating-point number while $e_x$ is the relative error committed when rounding to a normalized floating-point number. They cannot be present at the same time, which is expressed by condition $e_x \times d_x = 0$.

This model can be refined. The normalized floating-point rounding error $xe_x$ in (1) is actually bounded by the distance between two consecutive floating-point numbers around x. This distance can be expressed as $\text{ ufp }\!\left( x\right) \epsilon _M$, using the notion of unit in the first place $\text{ ufp }\!\left( \right) $ introduced in [15] and defined by:

$$\begin{aligned} \text{ ufp }\!\left( x\right) = {\left\{ \begin{array}{ll} 0 &{}\text {if } x = 0 \\ 2^{{\lfloor } {log_2 |x|} {\rfloor }} &{}\text {if } x \ne 0 \end{array}\right. } \end{aligned}$$

(2)

Function $\text{ ufp }\!\left( \right) $ is piecewise constant: the result of $\text{ ufp }\!\left( x\right) $ for $|x| \in [2^n, 2^{n+1})$ is the constant $2^n$. Using this definition, the gap $xe_x$ between the real and its floating-point representation can be rewritten as $\text{ ufp }\!\left( x\right) e_x$ and the rounding operator is now:

$$\begin{aligned} \mathrm {rnd}(x) = x + \text{ ufp }\!\left( x\right) e_x + d_x \end{aligned}$$

(3)

Absolute and Relative Elementary Rounding Errors. We now define $\varGamma _a(x)$ and $\varGamma _r(x)$ the elementary absolute and relative rounding errors which occur when a real number x is rounded to its floating-point approximation $\widetilde{x} = \mathrm {rnd}(x)$:

$$\begin{aligned} \varGamma _a(x) = \mathrm {rnd}(x) - x = \text{ ufp }\!\left( x\right) e_x + d_x \end{aligned}$$

(4)

The relative error is defined only when $x \ne 0$:

$$\begin{aligned} \varGamma _r(x) = \dfrac{\mathrm {rnd}(x) - x}{x} = \dfrac{\text{ ufp }\!\left( x\right) e_x + d_x}{x} \end{aligned}$$

(5)

2.2 Arithmetic Operations

The IEEE-754 norm standardises some operations that are required to be exactly rounded (addition, subtraction, multiplication, division and square root): the result of the floating-point operation on real operands is the same as if the operation was performed in real numbers on the given inputs, then rounded. For every operation $op : \mathbb {R}^k \rightarrow \mathbb {R}$ defined as exactly rounded, the corresponding floating-point operation $\widetilde{op}$ can be expressed as:

$$\begin{aligned} \widetilde{op}(x_1, \ldots , x_k) = \mathrm {rnd}(op(x_1, \ldots , x_k)) \end{aligned}$$

(6)

The IEEE754-2008 revision additionally recommends that fifty additional operators are correctly rounded. We do not handle these operations in this work, but the approach developed here can be extended.

We now consider the propagation of errors through successive operations. We denote by $\widetilde{x}$ the approximation of an idealized computation x. We thus define the absolute error due to the approximation by:

$$\begin{aligned} {\mathcal {E}}_{a}(x) = \widetilde{x} - x \end{aligned}$$

and the relative error, for $x\ne 0$, by:

$$\begin{aligned} {\mathcal {E}}_{r}(x) = \dfrac{\widetilde{x} - x}{x} \end{aligned}$$

The absolute error on the result of an operation op on values $\widetilde{x_1}, \ldots , \widetilde{x_k}$ which are already the approximations of some idealized values $x_1, \ldots , x_k$ is defined by:

$$\begin{aligned} {\mathcal {E}}_{a}(op(x_1,\ldots , x_k)) = \widetilde{op} (\widetilde{x_1},\ldots , \widetilde{x_k}) - op(x_1,\ldots ,x_k) \end{aligned}$$

where for all $i=1,\ldots ,k$, the approximated value $\widetilde{x_i}$ is such that $ \widetilde{x_i} = x_i + {\mathcal {E}}_{a}(x_i) = x_i(1 + {\mathcal {E}}_{r}(x_i))$.

Using Eqs. (6) and (4), this can be rewritten:

$$\begin{aligned} {\mathcal {E}}_{a}(op(x_1,\ldots , x_k)) = op(\widetilde{x_1},\ldots , \widetilde{x_k}) + \varGamma _a(op(\widetilde{x_1},\ldots , \widetilde{x_k})) - op(x_1,\ldots ,x_k) \end{aligned}$$

(7)

The relative error is derived, when $op(x_1,\ldots , x_k) \ne 0$:

$$\begin{aligned} {\mathcal {E}}_{r}(op(x_1,\ldots , x_k)) = \dfrac{{\mathcal {E}}_{a}(op(x_1,\ldots , x_k))}{op(x_1,\ldots , x_k)} \end{aligned}$$

(8)

2.3 Concrete Semantics

The concrete model is that of traditional numerical error analyzes, and in particular the static analysis [9], which describe the difference of behavior between the execution of a program in real numbers and in floating-point numbers, along the floating-point execution flow. We consider in this work the analysis of a language with the operations $ \{+, -, \times , \div , \sqrt{} \}$, which are required to be exactly rounded in the IEEE-754 standard, conditional statements and loops. The concrete value that we will compute for all program variables and control points of a program in this language, is $(x,\widetilde{x},{\mathcal {E}}_{a}(x),{\mathcal {E}}_{r}(x))$, where:

$\widetilde{x}$ is the result of the execution of the program in a floating-point semantics, until the control point of interest,
x is the result of the execution of the same sequence of arithmetic operations in a real semantics, ignoring the possibility of a control flow divergence due to rounding errors,
the errors between the real execution and the floating-point executions ${\mathcal {E}}_{a}(x)$ and ${\mathcal {E}}_{r}(x)$ are defined by Eqs. (7) and (8).

Conditional Statements and Unstable Tests. In this work, the path conditions are those of the floating-point executions. We thus ignore the possibility of unstable tests, when for same input values, the floating-point and the real-valued executions can take different branches of a conditional statement. We simply issue a warning when this possibility is detected, as in for instance early versions of Fluctuat [7, 9]. In case an unstable test actually occurs, the analysis is possibly unsound, as the discontinuity error between the computations performed in the two branches should be considered as an additional error. Relational analyzes are needed to estimate such discontinuity errors in a not overly conservative way, and this has been studied and implemented for instance in [5, 10]. But the problem is somewhat orthogonal to the interplay between relative and absolute error considered here, and is also not considered in the most closely related work [11, 16], to which we compare our analysis in the section dedicated to experiments. But we intend to handle unstable tests in the future, in a relational version of the present analysis.

3 Interval-Based Abstraction

Intervals [13] are used in many situations to rigorously compute with interval domains instead of reals. Throughout the paper, intervals are typeset in boldface letters. Let $\varvec{x} = \left[ \underline{x},\overline{x}\right] $ be such an interval, with its bounds $\underline{x} \le \overline{x}$ where $\underline{x} \in \mathbb {R} \cup \{-\infty \}$ and $\overline{x}\cup \{+\infty \}$. Interval arithmetic computes a bounding interval for each elementary operation by $\varvec{x} \circ \varvec{y} = [\min _{x \in \varvec{x}, y \in \varvec{y}} \{x \circ y\}, \max _{x \in \varvec{x}, y \in \varvec{y}} \{x \circ y\}],$ where $\circ \in \{+,-,\times ,\div \}$, and analogously for the square root. Intervals are the basis of one of the first and most widely used numerical abstract domains, the lattice of intervals [3].

In what follows, we propose an abstraction which relies on the lattice of intervals: we abstract with intervals $(\varvec{\widetilde{x}},\varvec{{\mathcal {E}}_{a}(x)},\varvec{{\mathcal {E}}_{r}(x)})$, the floating-point range, absolute and relative errors. The errors are computed, on each control-flow path, under the assumption for the error estimation that the real and floating-point executions follow the same path. We deduce bounds for the value in real-valued semantics by $\varvec{x}=\varvec{\widetilde{x}}-\varvec{{\mathcal {E}}_{a}(x)}$. The abstract domain forms a complete lattice, fully relying on the lattice of intervals, with a join operator performed componentwise on the value and errors using the classical join operator on intervals.

The rounding mode for computing the interval extremities on the intervals bounding the floating-point range will be the rounding mode of the computation we analyse (rounding to the nearest). The other terms, that bound the errors and the real-valued range, will be computed with outward rounding, in order to ensure a sound implementation.

3.1 Abstraction of the Elementary Rounding Errors

In this section, we define the abstraction $\varGamma _a(\varvec{x})$ and $\varGamma _r(\varvec{x})$ of the elementary rounding errors defined by (4) and (5). They will be used for the abstraction of transfer functions in Sect. 3.2.

Terms $e_x$ and $d_x$ that appear in the elementary rounding errors are bounded respectively in $\left[ -\epsilon _M,\epsilon _M\right] $ and $\left[ -\delta _M,\delta _M\right] $. Additionally, we know that $e_x$ and $d_x$ cannot be both non-zero for the same x. If x is rounded to a normalized number, then $d_x = 0$ and if x is rounded to a denormalized number, then $e_x = 0$. We can thus compute the abstraction of the elementary absolute rounding error over an interval of real numbers as the union of the two cases:

$$\begin{aligned} \varGamma _a(\varvec{x}) = \text{ ufp }\!\left( \varvec{x}\right) \epsilon (\varvec{x}) \cup \delta (\varvec{x}) \end{aligned}$$

(9)

where $\epsilon $ (resp. $\delta $) returns the interval $\left[ -\epsilon _M,\epsilon _M\right] $ (resp. $\left[ -\delta _M,\delta _M\right] $) if its parameter contains at least a normalized (resp. denormalized) number and $\left[ 0,0\right] $ otherwise. Moreover, as $\text{ ufp }\!\left( \right) $ is increasing in the absolute value of its argument, we can abstract the rounding error on normalized numbers by

$$\begin{aligned} \text{ ufp }\!\left( \varvec{x}\right) \epsilon (\varvec{x}) \subseteq \text{ ufp }\!\left( \max (|\underline{x}|, |\overline{x}|)\right) \epsilon (\varvec{x}). \end{aligned}$$

Let us define ${{\mathrm{norm}}}(\varvec{x})$ and ${{\mathrm{denorm}}}(\varvec{x})$ that return respectively the subsets of normalized and denormalized numbers from interval $\varvec{x}$. We can define the abstraction $\varGamma _r(\varvec{x})$ of the elementary relative error, for any interval $\varvec{x}$, as:

$$\begin{aligned} \varGamma _r(\varvec{x}) = \max _{x \in {{\mathrm{norm}}}(\varvec{x}), x \ne 0}\left| \frac{\text{ ufp }\!\left( x\right) }{x}\right| \left[ -\epsilon _M,\epsilon _M\right] \cup \max _{x \in {{\mathrm{denorm}}}(\varvec{x})} \frac{\left[ -\delta _M,\delta _M\right] }{|x|} \end{aligned}$$

(10)

Equation (10) will be used to derive in Sect. 3.2 relative error bounds also when interval $\varvec{x}$ possibly contains zero. These error bounds will be valid whenever the relative error is defined, that is for all non zero value in $\varvec{x}$.

Let us first evaluate in (10) the error due to the rounding of normalized numbers. Consider x strictly positive (the negative case is symmetric), then we can write:

$$\begin{aligned} \left| \frac{\text{ ufp }\!\left( x\right) }{x}\right| = \dfrac{2^{exp}}{sig \times 2^{exp}} = \dfrac{1}{sig} \end{aligned}$$

(11)

Given $sig \in [1,2)$, a simple abstraction of $\left| \frac{\text{ ufp }\!\left( x\right) }{x}\right| $ is the interval $\left( \frac{1}{2},1\right] $, and its maximum is always bounded by 1. However, we can slightly refine this estimate when there exists n such that $|\varvec{x}| \subseteq [2^n,2^{n+1})$. This gives, when $\varvec{x}$ does not contain 0:

$$\begin{aligned} \max _{x \in \varvec{x}}\left| \frac{\text{ ufp }\!\left( x\right) }{x}\right| = {\left\{ \begin{array}{ll} 1 / sig_{\min (|\underline{x}|,|\overline{x}|)} &{}\text {if } \exists n \in \mathbb {Z}, |\varvec{x}| \subseteq [2^n,2^{n+1}) \\ 1 &{}\text {otherwise} \end{array}\right. } \end{aligned}$$

(12)

Let us now consider the error due to denormalized numbers if $\varvec{x}$ contains any. Let us consider again x strictly positive. A positive denormalized number can be expressed as a multiple of $\delta _M$, i.e. $x = n\delta _M$ with $n \in \mathbb {Z}$, and an absolute error of magnitude at most $\delta _M$ can be committed, we thus abstract the relative error on denormalized numbers by:

$$\begin{aligned} \max _{x \in {{\mathrm{denorm}}}(\varvec{x})} \frac{\left[ -\delta _M,\delta _M\right] }{|x|} \subseteq [-1,1] \end{aligned}$$

(13)

3.2 Transfer Functions for Arithmetic Operations

Let us now study the transfer functions for each operation in $ \{+,-,\times ,\div ,\sqrt{\}}$.

First, the floating-point range $\varvec{\widetilde{x}}$ is abstracted classically in interval arithmetic. Then the absolute and relative error bounds are computed as described in this section, by an interval abstraction of (7) and (8). Finally, bounds for the value in real-valued semantics are deduced by $\varvec{x}=\varvec{\widetilde{x}}-\varvec{{\mathcal {E}}_{a}(x)}$.

Let us first state that after each operation, which yields a result $\varvec{z}=op(\varvec{x},\varvec{y})$, we perform a reduced product of the absolute and relative errors: Reduction

$$\begin{aligned} {\left\{ \begin{array}{ll} {\mathcal {E}}_{a}(\varvec{z}) = {\mathcal {E}}_{a}(\varvec{z}) \cap {\mathcal {E}}_{r}(\varvec{z}) \varvec{z} \\ {\mathcal {E}}_{r}(\varvec{z}) = {\mathcal {E}}_{r}(\varvec{z}) \cap \dfrac{{\mathcal {E}}_{a}(\varvec{z})}{\varvec{z}} \text{ whenever } 0 \notin \varvec{z} \end{array}\right. } \end{aligned}$$

(14)

We will see that, in particular for the division and the square root, the two types of errors are more tightly coupled than by the only use of this reduction. Indeed, some formulations which are equivalent on real numbers, yield different levels of conservativeness when computed abstracted. It is thus important to carefully state the precise expression of the propagation of errors through arithmetic operations. We detail below, for each arithmetic operation, propagation rules that provide a sound abstraction, while reducing the wrapping effect due to the use of intervals:

Lemma 1

(Addition and Subtraction)

$$\begin{aligned} {\mathcal {E}}_{a}(\varvec{x} \pm \varvec{y}) = \left( {\mathcal {E}}_{a}(\varvec{x}) \pm {\mathcal {E}}_{a}(\varvec{y})\right) + \varGamma _a(\varvec{\widetilde{x}} \pm \varvec{\widetilde{y}}) \end{aligned}$$

(15)

The relative error is defined only when $0 \notin \varvec{x} \pm \varvec{y}$. In this case, we have:

$$\begin{aligned} {\mathcal {E}}_{r}(\varvec{x} \pm \varvec{y}) = \left\{ \begin{array}{lr} \left( \dfrac{{\mathcal {E}}_{r}(\varvec{x}) - {\mathcal {E}}_{r}(\varvec{y})}{1 \pm \varvec{y} / \varvec{x}} + {\mathcal {E}}_{r}(\varvec{y}) \right) \left( 1 + \varGamma _r(\varvec{\widetilde{x}} \pm \varvec{\widetilde{y}}) \right) + \varGamma _r(\varvec{\widetilde{x}} \pm \varvec{\widetilde{y}}) &{}\text{ if } 0 \notin \varvec{x} \\ \left( \dfrac{{\mathcal {E}}_{r}(\varvec{y}) - {\mathcal {E}}_{r}(\varvec{x})}{1 \pm \varvec{x} / \varvec{y}} + {\mathcal {E}}_{r}(\varvec{x}) \right) \left( 1 + \varGamma _r(\varvec{\widetilde{x}} \pm \varvec{\widetilde{y}})\right) + \varGamma _r(\varvec{\widetilde{x}} \pm \varvec{\widetilde{y}}) &{}\text{ if } 0 \notin \varvec{y} \end{array} \right. \end{aligned}$$

(16)

When $\varvec{x}$ and $\varvec{y}$ both do not include zero, then the relative error can be computed as the intersection of the 2 estimates in (16).

Proof

The propagation of the absolute error corresponds to a classical absolute rounding error analysis, starting from Eq. (7) instantiated for the addition and subtraction: for all $x \in \varvec{x}$, $y \in \varvec{y}$, $\widetilde{x} = x + {\mathcal {E}}_{a}(x) \in \varvec{\widetilde{x}}$, $\widetilde{y} = x + {\mathcal {E}}_{a}(y) \in \varvec{\widetilde{y}}$,

$$\begin{aligned} {\mathcal {E}}_{a}(x \pm y)&= (\tilde{x} \pm \tilde{y}) + \varGamma _a(\tilde{x} \pm \tilde{y}) - (x \pm y) \\&= \left( (x + {\mathcal {E}}_{a}(x)) \pm (y + {\mathcal {E}}_{a}(y)) \right) + \varGamma _a(\tilde{x} \pm \tilde{y}) - (x \pm y) \\&= ({\mathcal {E}}_{a}(x) \pm {\mathcal {E}}_{a}(y)) + \varGamma _a(\tilde{x} \pm \tilde{y}) \end{aligned}$$

Abstracting this result in intervals, we get Eq. (15), which defines ${\mathcal {E}}_{a}(\varvec{x} \pm \varvec{y})$ as an interval over-approximation of $\{ {\mathcal {E}}_{a}(x \pm y), x \in \varvec{x}, y \in \varvec{y} \}.$

Note that we would naturally also get a sound over-approximation of ${\mathcal {E}}_{a}(\varvec{x} \pm \varvec{y})$ by directly computing in intervals $(\varvec{\widetilde{x}} \pm \varvec{\widetilde{y}}) + \varGamma _a(\varvec{\widetilde{x}} \pm \varvec{\widetilde{y}}) - (\varvec{x} \pm \varvec{y})$. However, the result would be very conservative in general, because interval arithmetic does not handle correlations. We thus derive, for each arithmetic operation, error formulas in real numbers by reorganizing terms in an equivalent expression but reducing variable repetitions. We then abstract the final expression in intervals.

For any binary operation $\circ $, for all $x \in \varvec{x}$, $y \in \varvec{y}$ such that $x \circ y \ne 0$, for all $\widetilde{x} = x + {\mathcal {E}}_{a}(x) $, $\widetilde{y} = y + Ea(y)$, we can compute the relative error as

$$\begin{aligned} {\mathcal {E}}_{r}(x \circ y)&= \dfrac{(\tilde{x} \circ \tilde{y}) + \varGamma _a(\tilde{x} \circ \tilde{y}) - (x \circ y)}{x \circ y} \\&= \dfrac{(\tilde{x} \circ \tilde{y}) + (\tilde{x} \circ \tilde{y})\varGamma _r(\tilde{x} \circ \tilde{y}) - (x \circ y)}{x \circ y} \end{aligned}$$

from which we have:

$$\begin{aligned} {\mathcal {E}}_{r}(x \circ y) = \dfrac{\tilde{x} \circ \tilde{y}}{x \circ y}(1 + \varGamma _r(\tilde{x} \circ \tilde{y})) - 1 \end{aligned}$$

(17)

We can deduce:

$$\begin{aligned} {\mathcal {E}}_{r}(x \pm y)&= \dfrac{x({\mathcal {E}}_{r}(x) + 1) \pm y(1 + {\mathcal {E}}_{r}(y))}{x \pm y} (1 + \varGamma _r(\tilde{x} \pm \tilde{y})) - 1 \\&= \dfrac{x{\mathcal {E}}_{r}(x) \pm y{\mathcal {E}}_{r}(y)}{x \pm y} (1 + \varGamma _r(\tilde{x} \pm \tilde{y})) + \varGamma _r(\tilde{x} \pm \tilde{y}) \end{aligned}$$

It is interesting to reformulate this expression in order to suppress as much as possible variable repetitions. This will reduce the loss of correlation when the expression will be evaluated in interval arithmetic. For $x\ne 0$, we can write:

$$\begin{aligned} {\mathcal {E}}_{r}(x \pm y)&= \left( \frac{x {\mathcal {E}}_{r}(x) - x {\mathcal {E}}_{r}(y)}{x \pm y} + \frac{x {\mathcal {E}}_{r}(y) \pm y {\mathcal {E}}_{r}(y)}{x \pm y} \right) (1 + \varGamma _r(\tilde{x} \pm \tilde{y})) + \varGamma _r(\tilde{x} \pm \tilde{y}) \\&= \left( \frac{{\mathcal {E}}_{r}(x) - {\mathcal {E}}_{r}(y)}{1 \pm \frac{y}{x}} + {\mathcal {E}}_{r}(y) \right) (1 + \varGamma _r(\tilde{x} \pm \tilde{y})) + \varGamma _r(\tilde{x} \pm \tilde{y}) \end{aligned}$$

A symmetric transformation can be done, exchanging x and y, which allows us to conclude with Eq. (16), after abstraction in intervals.

We note that the addition is the operation for which propagating relative error is less natural, and thus some accuracy loss can be expected.

Lemma 2

(Multiplication)

$$\begin{aligned} {\mathcal {E}}_{a}(\varvec{x} \times \varvec{y})&= \varvec{x}{\mathcal {E}}_{a}(\varvec{y}) + \varvec{y}{\mathcal {E}}_{a}(\varvec{x}) + {\mathcal {E}}_{a}(\varvec{x}){\mathcal {E}}_{a}(\varvec{y}) + \varGamma _a(\varvec{\widetilde{x}} \times \varvec{\widetilde{y}}) \end{aligned}$$

(18)

$$\begin{aligned} {\mathcal {E}}_{r}(\varvec{x} \times \varvec{y})&= ({\mathcal {E}}_{r}(\varvec{x}) + 1)({\mathcal {E}}_{r}(\varvec{y}) + 1)(1 + \varGamma _r(\varvec{\widetilde{x}} \times \varvec{\widetilde{y}}) ) - 1 \end{aligned}$$

(19)

Proof

As for the addition, the expression of the propagated absolute error by the multiplication is quite natural, and corresponds to a classical absolute rounding error analysis: for all $x \in \varvec{x}$, $y \in \varvec{y}$, $\tilde{x} = x + {\mathcal {E}}_{a}(x)$, $\tilde{y} = y + {\mathcal {E}}_{a}(y)$,

$$\begin{aligned} {\mathcal {E}}_{a}(x \times y)&= (\tilde{x} \times \tilde{y}) + \varGamma _a(\tilde{x} \times \tilde{y}) - (x \times y) \\&= (x + {\mathcal {E}}_{a}(x))(y + {\mathcal {E}}_{a}(y)) + \varGamma _a(\tilde{x} \times \tilde{y}) - (x \times y) \\&= x {\mathcal {E}}_{a}(y) + y {\mathcal {E}}_{a}(x) + {\mathcal {E}}_{a}(x){\mathcal {E}}_{a}(y) + \varGamma _a(\tilde{x} \times \tilde{y}) \end{aligned}$$

Starting from Eq. (17), we obtain an expression of the propagated relative error that naturally involves the relative errors on the operands:

$$\begin{aligned} {\mathcal {E}}_{r}(x \times y)&= \frac{\tilde{x} \times \tilde{y}}{x \times y} (1 + \varGamma _r(\tilde{x} \times \tilde{y})) - 1 \\&= \frac{x({\mathcal {E}}_{r}(x) + 1) \times y ({\mathcal {E}}_{r}(y) + 1)}{x \times y} (1 + \varGamma _r(\tilde{x} \times \tilde{y})) - 1 \\&= ({\mathcal {E}}_{r}(x) + 1)({\mathcal {E}}_{r}(y) + 1)(1 + \varGamma _r(\tilde{x} \times \tilde{y}) ) - 1 \end{aligned}$$

This propagation of relative errors should be quite accurate, as we could remove correlations to values of x and y.

$$\begin{aligned}&\texttt {x = a + 3 * b ;} \\&\texttt {y = c + 3 * d ;} \\&\texttt {z = x * y ;} \end{aligned}$$

Example 3

We consider a very simple example to exemplify our analysis. We have 4 input floating-point variables a, b, c, d given respectively in the ranges [0,1], [1,2], [0,1], and [1,2]. All these inputs are supposed to be known exactly, with no rounding error, i.e. $\forall \varvec{v} \in \{\varvec{a}, \varvec{b}, \varvec{c}, \varvec{d}\}, {\mathcal {E}}_{a}(\varvec{v}) = {\mathcal {E}}_{r}(\varvec{v}) = \left[ 0,0\right] $. We now consider the errors committed on the computations that define floating-point variables x, y, z. For the sake of demonstration, we explicit here the computation of errors in term of the machine epsilon $\epsilon _M$. In order to evaluate the error on variable x, we first evaluate 3b: its range of value is $3\varvec{b}=[3,6]$, and the errors are ${\mathcal {E}}_{r}(3\varvec{b}) = \varGamma _r(3\varvec{b}) = \left[ -\epsilon _M,+\epsilon _M\right] $ and ${\mathcal {E}}_{a}(3\varvec{b}) = \varGamma _a(3\varvec{b}) = \left[ -4\epsilon _M,+4\epsilon _M\right] $ The errors on x that results from the addition are obtained using (15) and (16):

$$\begin{aligned} {\mathcal {E}}_{a}(\varvec{x}) = {\mathcal {E}}_{a}(3\varvec{b}) + \varGamma _a(\left[ 3,7\right] ) = \left[ -8\epsilon _M,+8\epsilon _M\right] \end{aligned}$$

$$\begin{aligned} {\mathcal {E}}_{r}(\varvec{x})&= \left( \frac{{\mathcal {E}}_{r}(3\varvec{b}) - {\mathcal {E}}_{r}(\varvec{a})}{1 + a / 3\varvec{b}} + {\mathcal {E}}_{r}(\varvec{a})\right) (\varGamma _r(\left[ 3,7\right] ) + 1) + \varGamma _r(\left[ 3,7\right] ) \\&= \frac{\left[ -\epsilon _M,+\epsilon _M\right] }{\left[ 1,4/3\right] } (\left[ -\epsilon _M,+\epsilon _M\right] + 1) +\left[ -\epsilon _M,+\epsilon _M\right] = \left[ -2\epsilon _M,+2\epsilon _M\right] + O(\epsilon _M^2) \end{aligned}$$

Note that the bounds for the relative errors are $\frac{4}{3}$ times better than the a posteriori estimate ${\mathcal {E}}_{a}(\varvec{x}) / \varvec{x} = \left[ -\frac{8}{3}\epsilon _M,+\frac{8}{3}\epsilon _M\right] $. The errors on y are computed similarly. Finally, we can deduce absolute error bounds for z using (18):

$$\begin{aligned} {\mathcal {E}}_{a}(\varvec{z})&= \varvec{x}{\mathcal {E}}_{a}(\varvec{y}) + \varvec{y}{\mathcal {E}}_{a}(\varvec{x}) + {\mathcal {E}}_{a}(\varvec{x}){\mathcal {E}}_{a}(\varvec{y}) + \varGamma _a(\varvec{x} \times \varvec{y}) \\&= 2 \times \left[ 3,7\right] \left[ -8\epsilon _M,+8\epsilon _M\right] + \left[ -8\epsilon _M,+8\epsilon _M\right] \left[ -8\epsilon _M,+8\epsilon _M\right] + \varGamma _a(\left[ 9,49\right] ) \\&= \left[ -144\epsilon _M,+144\epsilon _M\right] + O(\epsilon _M^2) \end{aligned}$$

and relative error bounds using (19):

$$\begin{aligned} {\mathcal {E}}_{r}(\varvec{z})&= ({\mathcal {E}}_{r}(\varvec{x}) + 1)({\mathcal {E}}_{r}(\varvec{y}) + 1)(\varGamma _r(\varvec{x} \times \varvec{y}) + 1) - 1 \\&= (\left[ -2\epsilon _M,+2\epsilon _M\right] +1) (\left[ -2\epsilon _M,+2\epsilon _M\right] +1) (\varGamma _r(\left[ 9,49\right] ) + 1) - 1 + O(\epsilon _M^2) \\&= \left[ -5\epsilon _M,+5\epsilon _M\right] + O(\epsilon _M^2) \end{aligned}$$

Neglecting here for simplicity the second order errors, the reduced product yields an estimate for the relative error approximately 3.2 times better than the a posteriori estimate $ {\mathcal {E}}_{a}(\varvec{z}) / [9,49]$.

Lemma 3

(Division). The division is defined whenever $0 \notin y$, by:

$$\begin{aligned} {\mathcal {E}}_{a}(\varvec{x} \div \varvec{y})&= \frac{{\mathcal {E}}_{a}(\varvec{x}) - \varvec{x}{\mathcal {E}}_{r}(\varvec{y})}{\varvec{\widetilde{y}}} + \varGamma _a(\varvec{\widetilde{x}} \div \varvec{\widetilde{y}}) \end{aligned}$$

(20)

$$\begin{aligned} {\mathcal {E}}_{r}(\varvec{x} \div \varvec{y})&= \frac{{\mathcal {E}}_{r}(\varvec{x}) + 1}{{\mathcal {E}}_{r}(\varvec{y}) + 1} (1 + \varGamma _r(\varvec{\widetilde{x}} \div \varvec{\widetilde{y}})) - 1 \end{aligned}$$

(21)

Proof

The division is a case where the coupling between the computation of two errors is integrated: the absolute error on $x \div y$ naturally involves the absolute error on x and the relative error on y:

$$\begin{aligned} {\mathcal {E}}_{a}(x \div y)&= (\tilde{x} \div \tilde{y}) + \varGamma _a(\tilde{x} \div \tilde{y}) - (x \div y) \\&= \frac{y\tilde{x} - \tilde{y}x}{y\tilde{y}} + \varGamma _a(\tilde{x} \div \tilde{y}) \\&= \frac{y(x + {\mathcal {E}}_{a}(x)) - (y + {\mathcal {E}}_{a}(y))x}{y \tilde{y}} + \varGamma _a(\tilde{x} \div \tilde{y}) \\&= \frac{{\mathcal {E}}_{a}(x) - x{\mathcal {E}}_{r}(y)}{\tilde{y}} + \varGamma _a(\tilde{x} \div \tilde{y}) \end{aligned}$$

This is thus an operation where propagating tight relative error bounds on the operands proves useful to tighten the absolute error bounds on the result.

$$\begin{aligned} {\mathcal {E}}_{r}(x \div y)&= \frac{\tilde{x} \div \tilde{y}}{x \div y} (1 + \varGamma _r(\tilde{x} \div \tilde{y})) - 1 \\&= \frac{x({\mathcal {E}}_{r}(x) + 1)}{y({\mathcal {E}}_{r}(y) + 1)} \times \frac{y}{x} \times (1+ \varGamma _r(\tilde{x} \div \tilde{y})) - 1 \\&= \frac{{\mathcal {E}}_{r}(x) + 1}{{\mathcal {E}}_{r}(y) + 1} (1 + \varGamma _r(\tilde{x} \div \tilde{y})) - 1 \end{aligned}$$

As for the multiplication, we note that the propagation of relative errors should be generally quite accurate. Note also that as for the multiplication as well, the relative error bounds are defined even when $0 \in \varvec{x} \div \varvec{y}$ , as long as they are defined for $\varvec{x}$ and $\varvec{y}$. The bound will be valid for all nonzero values in $\varvec{x} \div \varvec{y}$. This is exemplified in Example 4.

Example 4

Let us come back to Example 2 of the introduction. An error is committed when computing $t+1$, the absolute error of $t+1$ is bounded by ${\mathcal {E}}_{a}(t+1)=[-5.68 e^{-14},5.68 e^{-14}]$, the relative error by ${\mathcal {E}}_{r}(t+1)=[-1.1e^{-16},1.1e^{-16}]$. Using Eq. (20) to bound the absolute error of the division, we obtain:

$$\begin{aligned} {\mathcal {E}}_{a}\left( \frac{\varvec{t}}{\varvec{t}+1} \right)&= - \frac{\varvec{t}{\mathcal {E}}_{r}(\varvec{t}+1)}{\varvec{\widetilde{t}}+1} + \varGamma _a\left( \frac{\varvec{\widetilde{t}}}{\varvec{\widetilde{t}}+1} \right) \\&=\frac{[0,900][-1.1e^{-16},1.1e^{-16}]}{[1,1000]} + [-5.68 e^{-14},5.68 e^{-14}] \\&=[-1.67e^{-13},1.67e^{-13}] \end{aligned}$$

If only absolute error bounds were available, we would replace $\varvec{x}{\mathcal {E}}_{r}(\varvec{y})$ by $\varvec{x}{\mathcal {E}}_{a}(\varvec{y})/ \varvec{y}$ in Eq. (20) and obtain the absolute error analysis used classically. We would then obtain as absolute error bound on $\varvec{t} / (\varvec{t+1})$:

$$\begin{aligned} \frac{[0,999]\times [-5.68 e^{-14},5.68 e^{-14}]}{ {[1,1000]}^2} + [-5.68 e^{-14},5.68 e^{-14}] = [-5.68 e^{-11},5.68 e^{-11}] \end{aligned}$$

which is 340 times larger than the absolute error bound computed using our reduced product.

The relative error on $\varvec{t} / (\varvec{t+1})$ is bounded where it is defined, that if for all $t \ne 0$, by:

$$\begin{aligned} {\mathcal {E}}_{r}\left( \frac{\varvec{t}}{\varvec{t}+1} \right)&= \frac{{\mathcal {E}}_{r}(\varvec{t}) + 1}{{\mathcal {E}}_{r}(\varvec{t+1}) + 1} \left( 1 + \varGamma _r\left( \frac{\tilde{t}}{\tilde{t}+1}\right) \right) - 1 \\&= \frac{1}{[-1.1e^{-16},1.1e^{-16}]+1}\left( 1+ \varGamma _r\left( \frac{\tilde{t}}{\tilde{t}+1}\right) \right) -1 \end{aligned}$$

For $\varvec{t} = [0,999]$ , $\varvec{t}$ and $\varvec{t} / (\varvec{t+1})$ contain zero. As t is an input, its relative error ${\mathcal {E}}_{r}(\varvec{t})$ is zero when it is defined, that is for all $t \ne 0$. The elementary relative error $\varGamma _r$ is also defined for all $t \ne 0$, and is bounded by the maximum relative error when using denormalized floating-point numbers around 0, given by (13): for all $t \ne 0, \; \varGamma _r(\frac{t}{t+1}) \subseteq [-1,1]$. Thus

$$\begin{aligned} {\mathcal {E}}_{r}\left( \frac{\varvec{t}}{\varvec{t}+1}\right)&= \frac{1}{[-1.1e^{-16},1.1e^{-16}]+1}\left( 1+ [-1,1]\right) -1&\subseteq 1+ [-1,1] -1 =[-1,1] \end{aligned}$$

The square root is an operation where the relative error on the operand naturally appears in the propagation:

Lemma 4

(Square root)

$$\begin{aligned} {\mathcal {E}}_{a}(\sqrt{\varvec{x}})&= \sqrt{\varvec{x}}(\sqrt{1+{\mathcal {E}}_{r}(\varvec{x})} - 1) + \varGamma _a(\sqrt{\varvec{\widetilde{x}}}) \end{aligned}$$

(22)

$$\begin{aligned} {\mathcal {E}}_{r}(\sqrt{\varvec{x}})&= \sqrt{1+{\mathcal {E}}_{r}(\varvec{x})} (\varGamma _r(\sqrt{\varvec{\widetilde{x}}}) + 1) - 1 \end{aligned}$$

(23)

Proof

In order to avoid the loss of correlation, it is natural to factorize $\sqrt{x}$:

$$\begin{aligned} {\mathcal {E}}_{a}(\sqrt{x})&= \sqrt{\widetilde{x}} + \varGamma _a(\sqrt{\widetilde{x}}) - \sqrt{x} \\&= \sqrt{x} (\sqrt{1+{\mathcal {E}}_{r}(x)} -1) + \varGamma _a(\sqrt{\widetilde{x}}) \end{aligned}$$

The expression of the relative error is deduced immediately.

$$\begin{aligned} {\mathcal {E}}_{r}(\sqrt{x})&= \frac{\sqrt{\tilde{x}}}{\sqrt{x}} (\varGamma _r(\sqrt{\tilde{x}}) + 1) - 1 \\&= \sqrt{\frac{x({\mathcal {E}}_{r}(x) + 1)}{x}} (\varGamma _r(\sqrt{\tilde{x}}) + 1) - 1 \\&= \sqrt{{\mathcal {E}}_{r}(x) + 1} (\varGamma _r(\sqrt{\tilde{x}}) + 1) - 1 \end{aligned}$$

Using similar developments, it is possible to handle more functions. We chose in this work to focus on the main arithmetic operations which error bounds have long been specified by the IEEE 754 norm.

3.3 Handling Conditional Statements

Interpretation of Conditional expressions. Let $\gamma (x_1, \ldots , x_n)$ a conditional expression defined by $f(x_1, \ldots , x_n) \diamond b$, with $\diamond \in \{ <, >, =, \le , \ge \}$. Let us denote by $\varvec{\widetilde{x}}_\gamma $ the interval abstraction of the floating-point value of variable x, after transformation by the interpretation of conditional $\gamma $. It is computed by the classical backward constraint propagation on intervals which filters out values of $\varvec{\widetilde{x}}$ that do not satisfy the constraint. Note that as already discussed in the section devoted to the concrete semantics, we consider the path condition only on the floating-point value. The bound on the relative error is left unchanged by the interpretation of constraints:

$$\begin{aligned} {\mathcal {E}}_{r}(\varvec{x}_{\gamma }) = {\mathcal {E}}_{r}(\varvec{x}) \end{aligned}$$

(24)

In a classical error analysis, that is, with no information about the relative error, the absolute error bounds are also left unchanged by the interpretation of this conditional: ${\mathcal {E}}_{a}(\varvec{x}_{\gamma }) = {\mathcal {E}}_{a}(\varvec{x})$. When available, the relative error bounds can be used to reduce the absolute error in the case the range of values $\varvec{x}_{\gamma }$ has been reduced compared to $\varvec{x}$ by the constraint propagation:

$$\begin{aligned} {\mathcal {E}}_{a}(\varvec{x}_{\gamma }) = {\mathcal {E}}_{a}(\varvec{x}) \cap \varvec{x}_{\gamma } {\mathcal {E}}_{r}(\varvec{x}). \end{aligned}$$

(25)

Example 5

In Example 1 of the introduction, the multiplication x:=i*i results in $\varvec{x}=[1,10000]$ with an elementary absolute rounding error ${\mathcal {E}}_{a}(x)=[-9.09e^{-13},9.09e^{-13}]$, and a relative error bound $ {\mathcal {E}}_{r}(x)=[-1.11e^{-16},1.11e^{-16}]$. The constraint $\texttt {x <=2}$ yields $\varvec{x_{(x \le 2)}}=[1,2]$ in the true branch. We can then reduce the absolute error bound on x in this branch (and thus on y), by ${\mathcal {E}}_{a}(\varvec{y}) = [1,2][-1.11e{-16},1.11e^{-16}]$, as already stated in this introductory example.

Join and widening. Joining values coming from different branches of the program analyzed supposes to define a join or upper bound operator on abstract values. The join can be performed componentwise on the value, relative and absolute errors, relying on the classical join on intervals. Naturally, this relies on the hypothesis that there is no unstable test. Similarly, a widening can be defined componentwise relying on any widening operator on intervals.

4 Implementation and Experimental Evaluation

We have implemented this approach as a new abstract domain called Numerors in the Abstract Interpretation plug-in Eva of the verification platform Frama-C^{Footnote 2}. Frama-C provides a collection of plug-ins that perform static analysis, deductive verification, and testing, for safety- and security-critical software. Those plug-ins can cooperate thanks to their integration on top of a shared kernel and data structures along with their compliance to a common specification language.

In what follows, we evaluate our approach by comparing the error bounds obtained by our tool against the state of the art tools Fluctuat [9], Daisy [11] and FPTaylor [16], on a set of representative benchmark examples.

Selection and Description of the Benchmarks. The examples are mostly extracted from the FPBench^{Footnote 3} suite for comparing verification tools on floating-point programs, to which we have added 4 examples of our own. Our selection has been guided by the will to keep a reasonably small set of examples, while including most classes of examples which were previously studied with the tools of the related work, Daisy and FPTaylor. We excluded the examples containing calls to mathematical functions like transcendentals that we do not handle, variations of the same examples that did not show a different behavior, and examples for which the inputs were not fully specified. Finally, we modified^{Footnote 4} the inputs of some examples so that all tools compute a non trivial relative error bounds.

Table 1. Tools comparison on the absolute errors

Full size table

The first four examples were written to highlight some features that were not well represented in FPBench, and in particular programs that include some conditional statements (the first three examples), and a program with square roots (the fourth). Benchmark log_approx computes an approximation of the logarithm of the square of its input, using a loop and a Taylor expansion. Benchmark conditional_ex is Example 1, and conditional_1 is similar but with more computation. Finally, sqrt_1 computes the function $\sqrt{2x+3}/(2\sqrt{x}+3)$. The remaining examples come from the FPbench suite. Example complex_sqrt belongs to the Herbie [14] suite. Examples from intro_example to test05_nonlin1_test2 come from the FPTaylor test suite, and intro_example and sec4_example are specifically used in [16] to introduce their technique. The remaining examples come either from the Rosa [5] or the FPTaylor test suites, and have already been used as benchmarks for both absolute and relative errors [11, 16].

Table 2. Tools comparison on the relative errors

Full size table

Methodology of Comparison. For each example, absolute error bounds on the output of interest are presented in Table 1, relative error bounds in Table 2. Table 3 presents the running times for each tool, on the complete set of benchmarks.

In Tables 1 and 2, underestimates of the errors are given in the first column denoted Under-Approx. They are obtained with Daisy, computing the maximum of errors obtained for runs on 100000 random input values. The second column denoted Numerors in all result tables presents the results of the approach presented in our work. Fluctuat in its Intervals mode is used in third column of Table 1 as a witness of the results obtained with a classical interval absolute error analysis. The results of the corresponding a posteriori computation of the relative error are presented in third column of Table 2. This also corresponds to the results of our tool Numerors on absolute error, without the reduced product. In the fourth column of Table 1, we give the results of Fluctuat in its affine arithmetic based relational mode. We do not report results in Table 2, as the error is only computed a posteriori, and does not bring much new information. Finally, Daisy and FPTaylor are state on the art references for the computation of relative error bounds, relying on optimisation techniques. Daisy provides many possible options, with different evaluation modes both for values and errors, that combine interval and affine arithmetic based estimates with SMT. We included a representative set, providing different trade-offs between efficiency and accuracy. Mode Daisy 1 is:

Mode Daisy 2 is

Finally, mode Daisy 3, which is dedicated to relative error (and provides to absolute error bound, hence is not included in Table 1), is:

Our understanding is that Daisy 3 corresponds to [11]. Note that Daisy does not currently handle conditional statements, so that we give no estimate for these examples. We can specify conditionals as constraints to FPTaylor, however this is inconvenient for loops, so that we do not give any estimate for log_approx.

Results. Let us first comment the timings results presented in Table 3, that correspond, for each tool, to the sum of times spent for each of the examples that could be analyzed. Fluctuat, which is dedicated to error analysis, is the fastest, even in its affine mode: the examples here being all quite simple, affine forms scale well. While our abstract domain should theoretically be comparable to Fluctuat Intervals, it is here 10 times slower, partially due to the less specialization of the frama-C core on which we rely. Finally and most importantly, we want to stress that Numerors is drastically more efficient than Daisy in any of its modes or than FPTaylor, which rely on SMT or optimization. We compare ourselves to the results given by these tools, not because we aim at being more accurate, but in order to demonstrate that we manage to often come close, or even beat in precision these more costly approaches, while keeping very low analysis costs.

Table 3. Times comparison

Full size table

On each line, the result in boldface letters corresponds to the best estimate, the one in italic one corresponds to the second best one.

Numerors does not manage to bound the relative error on four examples: (log_approx, test01_sum3, test03_nonlin2 and test04_dqmom9. The reason is that it finds that the range of the variable of interest includes zero for all these examples, and some addition and subtraction where involved with a result including zero. Daisy and FPTaylor also cannot bound the relative error on three of these examples, for the same reason. On the fourth (test01_sum3), Daisy and FPTaylor get finite bounds. This is because the actual range does not include zero, but a relational analysis is needed to infer this. Note that the reduced product would allow us to compute again relative error bounds on further computations (if they do not include zero in their result range). Finally, on example intro_example, which corresponds to Example 2, the range of the result also contains zero, and Numerors produces an error bound of 1 while Daisy and FPTaylor do not produce relative error bounds. This bound of 1 is valid for all non-zero values of the range, and corresponds to the maximum error bounds on denormalized numbers, as detailed in Example 4.

One obvious strength of our analysis concerns the interpretation of conditional statements. This is highlighted by the first three examples, where we are much better on absolute errors than FPTaylor, and comparable in relative error.

On the remaining benchmarks, our experiments often demonstrate that FPTaylor obtains the most accurate results, both in absolute and relative errors. However, for the absolute error, Numerors is still the best for 2 cases out of 25, and it is second best in more 10 cases. For the relative error, considering only the 23 examples for which at least one tool finds a finite bound, it is best on 4 examples, and sometimes even spectacularly so, and second best in 12 examples. Naturally, it is also always at least - and often considerably - better than Fluctuat in its interval mode. But it is also often better also than Fluctuat in its affine mode, when the loss of correlation due to intervals is not too important.

5 Conclusion and Future Work

We have demonstrated how a simple interval-based reduced product of absolute and relative error bounds greatly enhanced the absolute error bounds, at a very low additional cost. A possible additional interest of such analysis, is to further valorize these relative error bounds: indeed, some undesired phenomena of floating-point arithmetic, such as catastrophic cancellation, are best detected using relative error bounds.

Another possibility offered by this analysis is to refine error bounds by local subdivisions of the range of variables. Partitioning the range of some well chosen program variables is an approach used in most tools, that often allows them to considerably narrow value and error estimation. But this approach is costly, and it usually has to be performed globally on the program. We believe that relative error bounds can be used to enhance the quality of results using much cheaper local subdivisions, using an idea similar to the interpretation of conditional: the relative error bounds can be used to reduce the absolute error bounds, given some additional constraint on the value that corresponds to a local subdivision of the range. This refinement can be used to improve locally an error estimation, on demand. An additional advantage of the local compared to the global subdivision is that it can be realized on any quantity and not only on input variables.

Finally, a natural extension of this work, that we intend to investigate in the future, is its combination with relational abstractions such as affine forms. A simple way to do this is to simply adapt the analysis presented here, using a relational abstraction to bound the values and errors. But we expect that more refined interactions would improve the strength of the analysis. Using a relational abstraction will also allow us to accurately compute discontinuity errors between branches, and thus handle possible control flow divergences.

Notes

1.
http://fpbench.org.
2.
Our abstract domain should be included in an upcoming release of Frama-C/Eva (https://frama-c.com/value.html).
3.
http://fpbench.org.
4.
http://www.lix.polytechnique.fr/Labo/Maxime.Jacquemin/numerors.c for the examples.

References

Comba, J.L.D., Stolfi, J.: Affine arithmetic and its applications to computer graphics. In: SEBGRAPI 1993 (1993)
Google Scholar
Cousot, P., Cousot, R.: Static determination of dynamic properties of programs. In: Proceedings of the Second International Symposium on Programming, pp. 106–130. Dunod, Paris (1976)
Google Scholar
Cousot, P., Cousot, R.: Static determination of dynamic properties of programs. In: Proceedings of ISOP 1976, pp. 106–130. Dunod, Paris (1976)
Google Scholar
Darulova, E., Kuncak, V.: Trustworthy numerical computation in Scala. In: OOPSLA (2011)
Google Scholar
Darulova, E., Kuncak, V.: Sound compilation of reals. In: POPL 2014, pp. 235–248, New York, NY, USA. ACM (2014)
Google Scholar
Daumas, M., Melquiond, G.: Certification of bounds on expressions involving rounded operators. ACM Trans. Math. Softw. 37(1), 2:1–2:20 (2010)
Google Scholar
Delmas, D., Goubault, E., Putot, S., Souyris, J., Tekkal, K., Védrine, F.: Towards an industrial use of FLUCTUAT on safety-critical avionics software. In: Alpuente, M., Cook, B., Joubert, C. (eds.) FMICS 2009. LNCS, vol. 5825, pp. 53–69. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-04570-7_6
Chapter Google Scholar
Goubault, E.: Static analyses of the precision of floating-point operations. In: Cousot, P. (ed.) SAS 2001. LNCS, vol. 2126, pp. 234–259. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-47764-0_14
Chapter MATH Google Scholar
Goubault, E., Putot, S.: Static analysis of finite precision computations. In: Jhala, R., Schmidt, D. (eds.) VMCAI 2011. LNCS, vol. 6538, pp. 232–247. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-18275-4_17
Chapter Google Scholar
Goubault, E., Putot, S.: Robustness analysis of finite precision implementations. In: APLAS, pp. 50–57 (2013)
Google Scholar
Izycheva, A., Darulova, E.: On sound relative error bounds for floating-point arithmetic. In: Proceedings of the 17th Conference on Formal Methods in Computer-Aided Design, FMCAD 2017, pp. 15–22 (2017)
Google Scholar
Magron, V., Constantinides, G., Donaldson, A.: Certified roundoff error bounds using semidefinite programming. Submitted
Google Scholar
Moore, R.E., Kearfott, R.B., Cloud, M.J.: Introduction to Interval Analysis (2009)
Google Scholar
Wilcox, J.-R., Panchekha, P., Sanchez-Stern, A., Tatlock, Z.: Automatically improving accuracy for floating point expressions. In: Grove, D., Blackburn, S. (eds.) PLDI 2015, pp. 1–11. ACM (2015)
Google Scholar
Rump, S.M., Ogita, T., Oishi, S.: Accurate floating-point summation part i: faithful rounding. SIAM J. Sci. Comput. 31(1), 189–224 (2008)
Google Scholar
Solovyev, A., Jacobsen, C., Rakamarić, Z., Gopalakrishnan, G.: Rigorous estimation of floating-point round-off errors with symbolic taylor expansions. In: Bjørner, N., de Boer, F. (eds.) FM 2015. LNCS, vol. 9109, pp. 532–550. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-19249-9_33
Chapter Google Scholar
Titolo, L., Feliú, M.A., Moscato, M., Muñoz, C.A.: An abstract interpretation framework for the round-off error analysis of floating-point programs. Verification, Model Checking, and Abstract Interpretation. LNCS, vol. 10747, pp. 516–537. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-73721-8_24
Chapter Google Scholar

Download references

Acknowledgments

The work was partially supported by ANR project ANR-15-CE25-0002. We also gratefully acknowledge the help of the anonymous reviewers and Jérôme Féret, in improving the presentation of this work.

Author information

Authors and Affiliations

CEA, List, Software Reliability and Security Laboratory, PC 174, 91191, Gif-Sur-Yvette, France
Maxime Jacquemin & Franck Védrine
LIX, CNRS and École Polytechnique, Palaiseau, France
Sylvie Putot

Authors

Maxime Jacquemin
View author publications
You can also search for this author in PubMed Google Scholar
Sylvie Putot
View author publications
You can also search for this author in PubMed Google Scholar
Franck Védrine
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Maxime Jacquemin .

Editor information

Editors and Affiliations

Universität Freiburg, Freiburg, Germany
Andreas Podelski

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jacquemin, M., Putot, S., Védrine, F. (2018). A Reduced Product of Absolute and Relative Error Bounds for Floating-Point Analysis. In: Podelski, A. (eds) Static Analysis. SAS 2018. Lecture Notes in Computer Science(), vol 11002. Springer, Cham. https://doi.org/10.1007/978-3-319-99725-4_15

Download citation

DOI: https://doi.org/10.1007/978-3-319-99725-4_15
Published: 29 August 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-99724-7
Online ISBN: 978-3-319-99725-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Reduced Product of Absolute and Relative Error Bounds for Floating-Point Analysis

Abstract

Similar content being viewed by others

An Abstract Interpretation Framework for the Round-Off Error Analysis of Floating-Point Programs

Correct approximation of IEEE 754 floating-point arithmetic for program verification

Affine Arithmetic and Applications to Real-Number Proving

1 Introduction

Example 1

Example 2

2 Floating-Point Arithmetic and Rounding Errors

2.1 Floating-Point Numbers and Rounding Errors

2.2 Arithmetic Operations

2.3 Concrete Semantics

3 Interval-Based Abstraction

3.1 Abstraction of the Elementary Rounding Errors

3.2 Transfer Functions for Arithmetic Operations

Lemma 1

Proof

Lemma 2

Proof

Example 3

Lemma 3

Proof

Example 4

Lemma 4

Proof

3.3 Handling Conditional Statements

Example 5

4 Implementation and Experimental Evaluation

5 Conclusion and Future Work

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation