Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Issues related to numerical accuracy are almost as old as computer science. An important step towards the design of more reliable numerical software was the definition, in the 1980’s, of the IEEE754 Standard for floating-point arithmetic [2]. Since then, work has been carried out to determine the accuracy of floating-point computations by dynamic [3, 17, 29] or static [11, 13, 14] methods. This work has also been motivated by a few disasters due to numerical bugs [1, 15].

While existing approaches may differ strongly each other in their way of determining accuracy, they have a common objective: to compute approximations of the errors on the outputs of a program depending on the initial errors on the data and on the roundoff of the arithmetic operations performed during the execution. The present work focuses on a slightly different problem concerning the relations between precision and accuracy. Here, the term precision refers to the number of bits used to represent a value, i.e. its format, while the term accuracy is a bound on the absolute error \(|x-\hat{x}|\) between the represented \(\hat{x}\) value and the exact value x that we would have in the exact arithmetic.

We address the problem of determining the minimal precision on the inputs and on the intermediary results of a program performing floating-point computations in order to ensure a desired accuracy on the outputs. This allows compilers to select the most appropriate formats (for example IEEE754 half, single, double or quad formats [2, 23]) for each variable. It is then possible to save memory, reduce CPU usage and use less bandwidth for communications whenever distributed applications are concerned. So, the choice of the best floating-point formats is an important compile-time optimization in many contexts. Our approach is also easily generalizable to the fixed-point arithmetic for which it is important to determine data formats, for example in FPGAs [12, 19].

The first originality of our approach is to combine a forward and a backward static analysis, done by abstract interpretation [8, 9]. The forward analysis is classical. It propagates safely the errors on the inputs and on the results of the intermediary operations in order to determine the accuracy of the results. Next, based on the results of the forward analysis and on assertions indicating which accuracy the user wants for the outputs at some control points, the backward analysis computes the minimal precision needed for the inputs and intermediary results in order to satisfy the assertions. Not surprisingly, the forward and backward analyses can be applied repeatedly and alternatively in order to refine the results until a fixed-point is reached.

The second originality of our approach is to express the forward and backward transfer functions as a set of constraints made of propositional logic formulas and relations between affine expressions over integers (and only integers). Indeed, these relations remain linear even if the analyzed program contains non-linear computations. As a consequence, these constraints can be easily checked by a SMT solver (we use Z3 in practice [4, 21]). The advantage of the solver appears in the backward analysis, when one wants to determine the precision of the operands of some binary operation between two operands a and b, in order to obtain a certain accuracy on the result. In general, it is possible to use a more precise a with a less precise b or, conversely, to use a more precise b with a less precise a. Because this choice arises at almost any operation, there is a huge number of combinations on the admissible formats of all the data in order to ensure a given accuracy on the results. Instead of using an ad-hoc heuristic, we encode our problem as a set of constraints and we let a well-known, optimized solver generate a solution.

This article is organized as follows. We briefly introduce some elements of floating-point arithmetic, a motivating example and related work in Sect. 2. Our abstract domain as well as the forward and backward transfer functions are introduced in Sect. 3. The constraint generation is presented in Sect. 4 and experimental results are given in Sect. 5. Finally, Sect. 6 concludes.

2 Preliminary Elements

In this section we introduce some preliminary notions helpful to understand the rest of the article. Elements of floating-point arithmetic are introduced in Sect. 2.1. Further, an illustration of what our method does is given in Sect. 2.2. Related work is discussed in Sect. 2.3.

2.1 Elements of Floating-Point Arithmetic

We introduce here some elements of floating-point arithmetic [2, 23]. First of all, a floating-point number x in base \(\beta \) is defined by

$$\begin{aligned} \small x = s \cdot (d_0.d_1\ldots d_{p-1}) \cdot \beta ^e = s\cdot m \cdot \beta ^{e-p+1} \end{aligned}$$
(1)

where \(s\in \{-1,1\}\) is the sign, \(m=d_0d_1\ldots d_{p-1}\) is the significand, \(0\le d_i< \beta , 0\le i\le p-1, p\) is the precision and e is the exponent, \(e_{min}\le e \le e_{max}\).

A floating-point number x is normalized whenever \(d_0\not = 0\). Normalization avoids multiple representations of the same number. The IEEE754 Standard also defines denormalized numbers which are floating-point numbers with \(d_0=d_1=\ldots =d_k=0, k< p-1\) and \(e=e_{min}\). Denormalized numbers make underflow gradual [23]. The IEEE754 Standard defines binary formats (with \(\beta =2\)) and decimal formats (with \(\beta =10\)). In this article, without loss of generality, we only consider normalized numbers and we always assume that \(\beta =2\) (which is the most common case in practice). The IEEE754 Standard also specifies a few values for \(p, e_{min}\) and \(e_{max}\) which are summarized in Fig. 1. Finally, special values also are defined: nan (Not a Number) resulting from an invalid operation, \(\pm \infty \) corresponding to overflows, and \(+0\) and \(-0\) (signed zeros).

Fig. 1.
figure 1

Basic binary IEEE754 formats.

The IEEE754 Standard also defines five rounding modes for elementary operations over floating-point numbers. These modes are towards \(-\infty \), towards \(+\infty \), towards zero, to the nearest ties to even and to the nearest ties to away and we write them \(\circ _{-\infty }, \circ _{+\infty }, \circ _0, \circ _{\sim _e}\) and \(\circ _{\sim _a}\), respectively. The semantics of the elementary operations \(\diamond \in \{+,\ -,\ \times ,\ \div \}\) is then defined by

$$\begin{aligned} \small f_1\ \diamond _{\circ }\ f_2\ =\ \circ (f_1\ \diamond \ f_2) \end{aligned}$$
(2)

where \(\circ \in \{\circ _{-\infty },\circ _{+\infty },\circ _{0},\circ _{\sim _e},\circ _{\sim _a}\}\) denotes the rounding mode. Equation (2) states that the result of a floating-point operation \(\diamond _\circ \) done with the rounding mode \(\circ \) returns what we would obtain by performing the exact operation \(\diamond \) and next rounding the result using \(\circ \). The IEEE754 Standard also specifies how the square root function must be rounded in a similar way to Eq. (2) but does not specify the roundoff of other functions like sin, log, etc.

We introduce hereafter two functions which compute the unit in the first place and the unit in the last place of a floating-point number. These functions are used further in this article to generate constraints encoding the way roundoff errors are propagated throughout computations. The \(\mathsf {ufp}\) of a number x is

$$\begin{aligned} \small \mathsf {ufp}(x) = \min \big \{ i\in \mathbb {N}\ :\ 2^{i+1} > x\big \} = \lfloor \log _2(x)\rfloor . \end{aligned}$$
(3)

The \(\mathsf {ulp}\) of a floating-point number which significand has size p is defined by

$$\begin{aligned} \small \mathsf {ulp}(x) = \mathsf {ufp}(x) - p+1. \end{aligned}$$
(4)

The \(\mathsf {ufp}\) of a floating-point number corresponds to the binary exponent of its most significant digit. Conversely, the \(\mathsf {ulp}\) of a floating-point number corresponds to the binary exponent of its least significant digit. Note that several definitions of the \(\mathsf {ulp}\) have been given [22].

Fig. 2.
figure 2

Top left: initial program. Top right: annotations after analysis. Bottom left: forward analysis (one iteration). Bottom right: backward analysis (one iteration).

2.2 Overview of Our Method

Let us consider the program of Fig. 2 which implements a simple linear filter. At each iteration t of the loop, the output \(\texttt {y}_t\) is computed as a function of the current input \(\texttt {x}_t\) and of the values \(\texttt {x}_{t-1}\) and \(\texttt {y}_{t-1}\) of the former iteration. Our program contains several annotations. First, the statement on the last line of the code informs the system that the programmer wants to have 10 accurate binary digits on \(\texttt {y}_t\) at this control point. In other words, let \(\texttt {y}_t = d_0.d_1\ldots d_n\cdot 2^e\) for some \(n\ge 10\), the absolute error between the value v that \(\texttt {y}_t\) would have if all the computations where done with real numbers and the floating-point value \(\hat{v}\) of \(\texttt {y}_t\) is less than \(2^{e-11}: | v - \hat{v}| \le 2^{e-9}\).

Note that accuracy is not a property of a number but a number that states how closely a particular floating-point number matches some ideal true value. For example, using the basis \(\beta =10\) for the sake of simplicity, the floating-point value 3.149 represents \(\pi \) with an accuracy of 3. It itself has a precision of 4. It represents the real number 3.14903 with an accuracy of 4.

An abstract value \([a,b]_p\) represents the set of floating-point values with p accurate bits ranging from a to b. For example, in the code of Fig. 2, the variables \(\texttt {x}_{t-1}\) and \(\texttt {x}_{t}\) are initialized to the abstract value \([1.0,3.0]_{16}\) thanks to the annotation [1.0,3.0]#16. Let \(\mathbb {F}_p\) be the of set of all floating-point numbers with accuracy p. This means that, compared to exact value v computed in infinite precision, the value \(\hat{v}=d_0.d_1 \ldots d_n\cdot 2^e\) of \(\mathbb {F}_p\) is such that \(|v-\hat{v}|\le 2^{e-p+1}\). By definition, using the function ufp introduced in Eq. (3), for any \(x\in \mathbb {F}_p\) the roundoff error \(\varepsilon (x)\) on x is bounded by \(\small \varepsilon (x) < 2^{\mathsf {ulp}(x)} = 2^{\mathsf {ufp}(x)-p+1}\). Concerning the abstract values, intuitively we have the concretization function

$$\begin{aligned} \small \gamma ([a,b]_p)= \{ x\in \mathbb {F}_p\ :\ a \le x \le b \}. \end{aligned}$$
(5)

These abstract values are special cases of the values used in other work [18] in the sense that, in the present framework, the errors attached to floating-point numbers have form \([-2^u,2^u]\) for some integer u instead of arbitrary intervals with real bounds. Restricting the form of the errors enables one to simplify drastically the transfer functions for the backward analysis and the generation of constraints in Sect. 4. In this article, we focus on the accuracy of computations and we omit other problems related to runtime-errors [3, 5]. In particular, overflows are not considered and we assume that any number with p accurate digits belongs to \(\mathbb {F}_p\). In practice, a static analysis computing the ranges of the variables and rejecting programs which possibly contain overflows is done before our analysis.

Fig. 3.
figure 3

Example of forward addition: 3.0#16 + 1.0#16 = 4.0#17.

In our example, \(\texttt {x}_t\) and \(\texttt {x}_{t-1}\) belong to \([1.0,3.0]_{16}\) which means, by definition, that these variables have a value \(\hat{v}\) ranging in [1.0, 3.0] and such that the error between \(\hat{v}\) and the value v that we would have in the exact arithmetic is bounded by \(2^{\mathsf {ufp}(x)-15}\). Typically, in this example, this information would come from the specification of the sensor related to x. By default, the values for which no accuracy annotation is given (for instance the value of \(\texttt {y}_{t-1}\) in the example of Fig. 2) are considered as exact numbers rounded to the nearest in double precision. In this format numbers have 53 bits of significand (see Fig. 1). The last bit being rounded, these numbers have 52 accurate bits in our terminology and, consequently, by default values belong to \(\mathbb {F}_{52}\) in our framework. Based on the accuracy of the inputs, our forward analysis computes the accuracy of all the other variables and expressions. The program in the left bottom corner of Fig. 2 displays the result of the forward analysis on the first iteration of the loop. Let denote the forward addition (all the operations used in the current example are formally defined in Sect. 3). For example, the result of \(\texttt {x}_t+\texttt {x}_{t-1}\) has 16 accurate digits since

figure a

This is illustrated in Fig. 3 where we consider the addition of these values at the bit level. For the result of the addition between intervals, we take the most pessimistic accuracy: .

The backward analysis is performed after the forward analysis and takes advantage of the accuracy requirement at the end of the code (see the right bottom corner of Fig. 2 for an unfolding of the backward analysis on the first iteration of the loop). Since, in our example, 10 bits only are required for \(\texttt {y}_t\), the result of the addition \(\texttt {u}{+}{} \texttt {v}\) also needs 10 accurate bits only. By combining this information with the result of the forward analysis, it is then possible to lower the number of bits needed for one of the operands. Let \(\overset{\leftarrow }{\oplus }\) be the backward addition. For example, for \(\texttt {x}_{t}{+}{} \texttt {x}_{t-1}\) in the assignment of v, we have:

figure b

Conversely to the forward function, the interval function now keeps the largest accuracy arising in the computation of the bounds:

figure c
Fig. 4.
figure 4

Final program with generated data types for the example of Fig. 2.

By processing similarly on all the elementary operations and after computation of the loop fixed point, we obtain the final result of the analysis displayed in the top right corner of Fig. 2. This information may be used to determine the most appropriate data type for each variable and operation, as shown in Fig. 4. To obtain this result we generate a set of constraints corresponding to the forward and backward transfer functions for the operations of the program. There exists several ways to handle a backward operation: when the accuracy on the inputs x and y computed by the forward analysis is too large wrt. the desired accuracy on the result, one may lower the accuracy of either x or y or both.

Since this question arises at each binary operation, we would face to a huge number of combinations if we decided to enumerate all possibilities. Instead, we generate a disjunction of constraints corresponding to the minimization of the accuracy of each operand and we let the solver search for a solution. The control flow of the program is also encoded with constraints. For a sequence of statements, we relate the accuracy of the former statements to the accuracy of the latter ones. Each variable x has three parameters: its forward, backward and final accuracy, denoted \(\mathsf {acc}_F(x)\), \(\mathsf {acc}_B(x)\) and \(\mathsf {acc}(x)\) respectively. We must always have

$$\begin{aligned} \small 0\le \mathsf {acc}_B(x) \le \mathsf {acc}(x)\le \mathsf {acc}_F(x). \end{aligned}$$
(6)

For the forward analysis, the accuracy of some variable may decrease when passing to the next statement (we may only weaken the pre-conditions). Conversely, in the backward analysis, the accuracy of a given variable may increase when we jump to a former statement in the control graph (the post-conditions may only be strengthened). For a loop, we relate the accuracy of the variables at the beginning and at the end of the body, in a standard way.

The key point of our technique is to generate simple constraints made of propositional logic formulas and of affine expressions among integers (even if the floating-point computations in the source code are non-linear). A static analysis computing safe ranges at each control point is performed before our accuracy analysis. Then the constraints depend on two kinds of integer parameters: the ufp of the values and their accuracies \(\mathsf {acc}_F, \mathsf {acc}_B\) and \(\mathsf {acc}\). For instance, given control points \(\ell _1, \ell _2\) and \(\ell _3\), the set C of constraints generated for \(\texttt {3.0\#16}^{\ell _1}+^{\ell _3} \texttt {1.0\#16}^{\ell _2}\), assuming that we require 10 accurate bits for the result are:

For the sake of conciseness, the constraints corresponding to Eq. (6) have been omitted in C. For example, for the forward addition, the accuracy \(\mathsf {acc}_F(\ell _3)\) of the result is the number of bits between \(\mathsf {ufp}(3.0+1.0)=2\) and the \(\mathsf {ufp}\) u of the error which is

$$\begin{aligned}\small u= & {} \max \big (\mathsf {ufp}(3.0)-\mathsf {acc}_F(\ell _1),\mathsf {ufp}(1.0)-\mathsf {acc}_F(\ell _2)\big )+i\\= & {} \max \big (1-\mathsf {acc}_F(\ell _1),0-\mathsf {acc}_F(\ell _2)\big )+i, \end{aligned}$$

where \(i=0\) or \(i=1\) depending on some condition detailed later. The constraints generated for each kind of expression and command are detailed in Sect. 4.

2.3 Related Work

Several approaches have been proposed to determine the best floating-point formats as a function of the expected accuracy on the results. Darulova and Kuncak use a forward static analysis to compute the propagation of errors [11]. If the computed bound on the accuracy satisfies the post-conditions then the analysis is run again with a smaller format until the best format is found. Note that in this approach, all the values have the same format (contrarily to our framework where each control-point has its own format). While Darulova and Kuncak develop their own static analysis, other static techniques [13, 29] could be used to infer from the forward error propagation the suitable formats. Chiang et al. [7] have proposed a method to allocate a precision to the terms of an arithmetic expression (only). They use a formal analysis via Symbolic Taylor Expansions and error analysis based on interval functions. In spite of our linear constraints, they solve a quadratically constrained quadratic program to obtain annotations.

Other approaches rely on dynamic analysis. For instance, the Precimonious tool tries to decrease the precision of variables and checks whether the accuracy requirements are still fulfilled [24, 27]. Lam et al. instrument binary codes in order to modify their precision without modifying the source codes [16]. They also propose a dynamic search method to identify the pieces of code where the precision should be modified. Finally, another related research axis concerns the compile-time optimization of programs in order to improve the accuracy of the floating-point computation in function of given ranges for the inputs, without modifying the formats of the numbers [10, 26].

3 Abstract Semantics

In this section, we give a formal definition of the abstract domain and transfer functions presented informally in Sect. 2. The domain is defined in Sect. 3.1 and the transfer functions are given in Sect. 3.2.

3.1 Abstract Domain

Let \( \mathbb {F}_p \) be the set floating-point numbers with accuracy p (we assume that the error between \(x\in \mathbb {F}_p\) and the value that we would have in the exact arithmetic is less than \(2^{\mathsf {ufp}(x)-p+1}\)) and let \(\mathbb {I}_p\) be the set of all intervals of floating-point numbers with accuracy p. As mentioned in Sect. 2.2, we assume that no overflow arises during our analysis and we omit to specify the lower and upper bounds of \(\mathbb {F}_p\). An element \(i^{\sharp }\in \mathbb {I}_p\), denoted \( i^{\sharp }=[\underline{f},\overline{f}]_{p} \), is then defined by two floating-point numbers and an accuracy p. We have

$$\begin{aligned} \small \mathbb {I}_p \ni [\underline{f},\overline{f}]_{p} =\{f\in \mathbb {F}_p:\underline{f}\le f \le \overline{f}\}\ \text {and}\ \mathbb {I}= \bigcup _{p\in \mathbb {N}}\mathbb {I}_p. \end{aligned}$$
(7)

Our abstract domain is the complete lattice \(\mathcal {D}^\sharp =\langle \mathbb {I},\sqsubseteq ,\sqcup ,\sqcap ,\bot _\mathbb {I},\top _\mathbb {I}\rangle \) where elements are ordered by \(\small [a,b]_p \sqsubseteq [c,d]_q \Longleftrightarrow [a,b]\subseteq [c,d]\ \text {and}\ q\le p\). In other words, \([a,b]_p\) is more precise than \([c,d]_q\) if it is an included interval with a greater accuracy. Let \(\circ _{r,m}(x)\) denote the rounding of x at precision r using the rounding mode m. Then the join and meet operators are defined by

$$\begin{aligned} \small [a,b]_p \sqcup [c,d]_q = [\circ _{r,-\infty }(u),\circ _{r,+\infty }(v)]_r\ \text {with}\ r=\min (p,q),\ [u,v] =[a,b]\cup [c,d] \end{aligned}$$
(8)

and

$$\begin{aligned} \small [a,b]_p \sqcap [c,d]_q = [u,v]_r\ \text {with}\ r=\max (p,q),\ [u,v]=[a,b]\cap [c,d]. \end{aligned}$$
(9)

In addition, we have \(\small \bot _\mathbb {I} = \emptyset _{+\infty }\ \text {and}\ \top _\mathbb {I} = [-\infty ,+\infty ]_0 \) and we have \([a,b]_p \sqcap [c,d]_q = \bot _\mathbb {I}\) whenever \([a,b]\cap [c,d]=\emptyset \). Let \(\alpha :\wp (\mathbb {F})\rightarrow \mathbb {I} \) be the abstraction function which maps a set of floating-point numbers X with different accuracies \(p_i\), \(1\le i\le n\) to a value of \(\mathbb {I}\). Let \(x_{min}=\min (X)\), \(x_{max}=\max (X)\) and \( p =\min \ \{ q\ :\ x\in X \ \text {and}\ x\in \mathbb {F}_q\}\) the minimal accuracy in X. We have,

$$\begin{aligned} \small \alpha (X)=[\circ _{p,-\infty }\big (\min (X)\big ),\circ _{p,+\infty }\big (\max (X)\big )]_p\quad \text {where}\ p=\min \ \{q\ :\ X\cap \mathbb {F}_q\not = \emptyset \}. \end{aligned}$$
(10)

Let \( \gamma : \mathbb {I}\rightarrow \wp (\mathbb {F}) \) and \( i^{\sharp }=[a,b]_{p} \). The concretization function \( \gamma (i^{\sharp }) \) is defined as:

$$\begin{aligned} \small \gamma (i^{\sharp })=\bigcup _{q\ge p}\{ x\in \mathbb {F}_q\ : a\le x\le b\}. \end{aligned}$$
(11)

Using the functions \(\alpha \) and \(\gamma \) of Eqs. (10) and (11), we define the Galois connection [8].

3.2 Transfer Functions

In this section, we introduce the forward and backward transfer functions for the abstract domain \(\mathcal {D}^\sharp \) of Sect. 3.1. These functions are defined using the unit in the first place of a floating-point number introduced in Sect. 2.1. First, we introduce the forward transfer functions corresponding to the addition and product of two floating-point numbers \(x\in \mathbb {F}_p\) and \(y\in \mathbb {F}_q\). The addition and product are defined by

$$\begin{aligned} \small \overset{\rightarrow }{\oplus }(x_p,y_q) = (x + y)_r \ \text {where}\ r = \mathsf {ufp}(x+y)-\mathsf {ufp}\big (\varepsilon (x_p)+\varepsilon (y_q)\big ), \end{aligned}$$
(12)
$$\begin{aligned} \small \overset{\rightarrow }{\otimes }(x_p,y_q) = (x \times y)_r \ \text {where}\ r = \mathsf {ufp}(x\times y) - \mathsf {ufp}\big (y\cdot \varepsilon (x_p)+x\cdot \varepsilon (y_q)+ \varepsilon (x_p)\cdot \varepsilon (y_q) \big ). \end{aligned}$$
(13)

In Eqs. (12) and (13), \(x + y\) and \(x \times y\) denote the exact sum and product of the two values. In practice, this sum must be done with enough accuracy in order to ensure that the result has accuracy r, for example by using more precision than the accuracy of the inputs. The errors on the addition and product may be bounded by \(e_+=\varepsilon (x_p)+\varepsilon (y_q)\) and \(e_\times =y\cdot \varepsilon (x_p)+x\cdot \varepsilon (y_q)+ \varepsilon (x_p)\cdot \varepsilon (y_q)\), respectively. Then the most significant bits of the errors have weights \(\mathsf {ufp}(e_+)\) and \(\mathsf {ufp}(e_\times )\) and the accuracies of the results are \(\mathsf {ufp}(x + y)-\mathsf {ufp}(e_+)\) and \(\mathsf {ufp}(x \times y)-\mathsf {ufp}(e_\times )\), respectively.

Fig. 5.
figure 5

Forward and backward transfer functions for the addition and product on \(\mathcal {D}^\sharp \).

We introduce now the backward transfer functions \(\overset{\leftarrow }{\oplus }\) and \(\overset{\leftarrow }{\otimes }\). We consider the operation between \(x_p\) and \(y_q\) whose result is \(z_r\). Here, \(z_r\) and \(y_q\) are known while \(x_p\) is unknown. We have

$$\begin{aligned} \small {\overset{\leftarrow }{\oplus }(z_r,y_q) = (z-y)_{p}\ \text {where}\ p = \mathsf {ufp}(z-y) - \mathsf {ufp}\big (\varepsilon (z_r)-\varepsilon (y_q) \big ),} \end{aligned}$$
(18)
$$\begin{aligned} \small {\overset{\leftarrow }{\otimes }(z_r,y_q) = (z \div y)_{p}\ \text {where}\ p = \mathsf {ufp}(z\div y) - \mathsf {ufp}\left( \frac{y\cdot \varepsilon (z_r)-z\cdot \varepsilon (y_q)}{ y\cdot \left( y+\varepsilon (y_q) \right) }\right) .} \end{aligned}$$
(19)

The correctness of the backward product relies on the following arguments. Let \(\varepsilon (x), \varepsilon (y)\) and \(\varepsilon (z)\) be the exact errors on xy and z respectively. We have \(\small \varepsilon (z) = x\cdot \varepsilon (y)+y\cdot \varepsilon (x)+\varepsilon (x)\cdot \varepsilon (y) \) and then \(\small \varepsilon (x)\cdot \left( y+\varepsilon (y)\right) = \varepsilon (z)-x\cdot \varepsilon (y)=\varepsilon (z)-\frac{z}{y}\cdot \varepsilon (y). \) Finally, we conclude that \(\small \varepsilon (x)=\frac{y\cdot \varepsilon (z_r)-z\cdot \varepsilon (y_q)}{ y\cdot \left( y+\varepsilon (y_q) \right) }\).

We end this section by extending the operations to the values of the abstract domain \(\mathcal {D}^\sharp \) of Sect. 3.1. First, let \(p\in \mathbb {N}\), let \(m\in \{-\infty ,+\infty ,\sim _e, \sim _a,0\}\) be a rounding mode and let \(\circ _{p,m}\ : \mathbb {F}\rightarrow \mathbb {F}_p\) be the rounding function which returns the roundoff of a number at precision p using the rounding mode m. We write and \(\overset{\leftarrow }{\boxplus }\) the forward and backward addition and and \(\overset{\leftarrow }{\boxtimes }\) the forward and backward products on \(\mathcal {D}^\sharp \). These functions are defined in Fig. 5. The forward functions and take two operands \([\underline{x},\overline{x}]_p\) and \([\underline{y},\overline{y}]_q\) and return the resulting abstract value \([\underline{z},\overline{z}]_r\). The backward functions take three arguments: the operands \([\underline{x},\overline{x}]_p\) and \([\underline{y},\overline{y}]_q\) known from the forward pass and the result \([\underline{z},\overline{z}]_r\) computed by the backward pass [20]. Then \(\overset{\leftarrow }{\boxplus }\) and \(\overset{\leftarrow }{\boxtimes }\) compute the backward value \([\underline{x'},\overline{x'}]_{p'}\) of the first operand. The backward value of the second operand can be obtained by inverting the operands \([\underline{x},\overline{x}]_p\) and \([\underline{y},\overline{y}]_q\). An important point in these formulas is that, in forward mode, the resulting intervals inherit from the minimal accuracy computed for their bounds while, in backward mode, the maximal accuracy computed for the bounds is assigned to the interval.

4 Constraint Generation

In this section, we introduce our system of constraints. The transfer functions of Sect. 3 are not directly translated into constraints because the resulting system would be too difficult to solve, containing non-linear constraints among non-integer quantities. Instead, we reduce the problem to a system of constraints made of linear relations between integer elements only. Sections 4.1 and 4.2 introduce the constraints for arithmetic expressions and programs, respectively.

4.1 Constraints for Arithmetic Expressions

In this section, we introduce the constraints generated for arithmetic expressions. As mentioned in Sect. 2, we assume that a range analysis is performed before the accuracy analysis and that a bounding interval is given for each variable and each value at any control point of the input programs.

Let us start with the forward operations. Let \(x_p\in \mathbb {F}_p\) and \(y_q\in \mathbb {F}_q\) and let us consider the operation \(\overset{\rightarrow }{\oplus }(x_p,y_q)=z_r\). We know from Eq. (12) that \(r_+ = \mathsf {ufp}(x+y)-\mathsf {ufp}(\varepsilon _+)\) with \(\varepsilon _+=\varepsilon (x_p)+\varepsilon (y_q)\). We need to over-approximate \(\varepsilon _+\) in order to ensure \(r_+\). Let \(a=\mathsf {ufp}(x)\) and \(b=\mathsf {ufp}(b)\). We have \(\small \varepsilon (x)< 2^{a-p+1}\) and \(\varepsilon (y)< 2^{b-p+1} \) and, consequently, \(\small \varepsilon _+ < 2^{a-p+1}+ 2^{b-p+1}. \) We introduce the function \(\iota \) defined by \(\small \iota (u,v) = \left\{ \begin{array}{l} 1\ \text {if}\ u=v,\\ 0\ \text {otherwise} \end{array}\right. \). We have

$$\begin{aligned} \small \mathsf {ufp}(\varepsilon _+)< & {} \max (a-p+1,b-q+1)+\iota (a-p,b-q) \\\le & {} \max (a-p,b-q)+\iota (a-p,b-q) \end{aligned}$$

and we conclude that

$$\begin{aligned} \small r_+ = \mathsf {ufp}(x+y)-\max (a-p,b-q)-\iota (a-p,b-q). \end{aligned}$$
(20)

Note that, since we assume that a range analysis has been performed before the accuracy analysis, \(\mathsf {ufp}(x+y)\), a and b are known at constraint generation time. For the forward product, we know from Eq. (13) that \(r_\times = \mathsf {ufp}(x\times y)-\mathsf {ufp}(\varepsilon _\times )\) with \(\varepsilon _\times =x\cdot \varepsilon (y_q)+y\cdot \varepsilon (x_p)+\varepsilon (x_p)\cdot \varepsilon (y_q)\). Again, let \(a=\mathsf {ufp}(x)\) and \(b=\mathsf {ufp}(b)\). We have, by definition of \(\mathsf {ufp}\), \(\small 2^a \le x< 2^{a+1}\quad \text {and}\quad 2^b \le y< 2^{b+1}\). Then \(\varepsilon _\times \) may be bound by

$$\begin{aligned} \small \varepsilon _\times< & {} 2^{a+1}\cdot 2^{b-q+1} + 2^{b+1}\cdot 2^{a-p+1}+ 2^{a-p+1}\cdot 2^{b-q+1}\\= & {} 2^{a+b-q+2}+ 2^{a+b-p+2}+ 2^{a+b-p-q+2}. \end{aligned}$$

Since \(a+b-p-q+2< a+b-p+2\) and \(a+b-p-q+2< a+b-q+2\), we may get rid of the last term of the former equation and we obtain that

$$\begin{aligned} \small \mathsf {ufp}(\varepsilon _\times )< & {} \max (a+b-p+2,a+b-q+2)+\iota (p,q)\\\le & {} \max (a+b-p+1,a+b-q+1)+\iota (p,q). \end{aligned}$$

We conclude that

$$\begin{aligned} \small r_\times = \mathsf {ufp}(x\times y)-\max (a+b-p+1,a+b-q+1)-\iota (p,q). \end{aligned}$$
(21)

Note that, by reasoning on the exponents of the values, the constraints resulting from a product become linear. We consider now the backward transfer functions. If \(\overset{\leftarrow }{\oplus }(z_r,y_q)=x_{p_+}\) then we know from Eq. (18) that \(p_+=\mathsf {ufp}(z-y)-\mathsf {ufp}(\varepsilon _+)\) with \(\varepsilon _+=\varepsilon (z_r)-\varepsilon (y-q)\). Let \(c=\mathsf {ufp}(z)\), we over-approximate \(\varepsilon _+\) using the relations \(\small \varepsilon (z_r) < 2^{c-r+1}\) and \(\varepsilon (y_q)>0 \). So, \(\mathsf {ufp}(\varepsilon _+)< c-r+1\) and

$$\begin{aligned} \small p_+ = \mathsf {ufp}(z-y)-c+r \end{aligned}$$
(22)

Finally, for the backward product, using Eq. (19) we know that if \(\overset{\leftarrow }{\otimes }(z_r,y_q)=x_{p_\times }\) then \(p_\times = \mathsf {ufp}(x)-\mathsf {ufp}(\varepsilon _\times )\) with \(\small \varepsilon _\times =\frac{y\cdot \varepsilon (z)-z\cdot \varepsilon (y)}{y\cdot (y+\varepsilon (y))}. \) Using the relations \( 2^b\le y < 2^{b+1}\), \(2^c\le z < 2^{c+1}\), \(\varepsilon (y)< 2^{b-q+1}\) and \(\varepsilon (z) < 2^{c-r+1}\), we deduce that \(\small y\cdot \varepsilon (z)-z\cdot \varepsilon (y)< 2^{b+c-r+2}-2^{b+c-q+1} \) and that \( \frac{1}{y\cdot (y+\varepsilon (y))}<2^{-2b}\). Consequently, \(\small \varepsilon _\times < 2^{-2b}\cdot ( 2^{b+c-r+2}-2^{b+c-q+1}) \le 2^{c-b-r+1}-2^{c-b-q} \) and it results that

$$\begin{aligned} \small p_\times = \mathsf {ufp}(x)-\max (c-b-r+1,c-b-q). \end{aligned}$$
(23)
Fig. 6.
figure 6

Constraint generation for arithmetic expressions.

Fig. 7.
figure 7

Constraint generation for commands.

4.2 Systematic Constraint Generation

To explain the constraint generation, we use the simple imperative language of Eq. (24) in which a unique label \(\ell \in \text {Lab}\) is attached to each expression and command to identify without ambiguity each node of the syntactic tree.

(24)

As in Sect. 2, \(c\texttt {\#p}\) denotes a constant c with accuracy p and the statement indicates that x must have at least accuracy n at control point \(\ell \). The set of identifiers occurring in the source program is denoted \(\text {Id}\). Concerning the arithmetic expressions, we assign to each label \(\ell \) of the expression three variables in our system of constraints, \(\mathsf {acc}_F(\ell )\), \(\mathsf {acc}_B(\ell )\) and \(\mathsf {acc}(\ell )\) respectively corresponding to the forward, backward and final accuracies and we systematically generate the constraints \(\small 0 \le \mathsf {acc}_B(\ell )\le \mathsf {acc}(\ell )\le \mathsf {acc}_F(\ell )\).

For each control point in an arithmetic expression, we assume given a range \([\underline{\ell },\overline{\ell }]\subseteq \mathbb {F}\), computed by static analysis and which bounds the values possibly occurring at Point \(\ell \) at run-time. Our constraints use the unit in the first place \(\mathsf {ufp}(\underline{\ell })\) and \(\mathsf {ufp}(\overline{\ell })\) of these ranges. Let \(\varLambda \ :\ \text {Id}\rightarrow \text {Id}\times \text {Lab}\) be an environment which relates each identifier x to its last assignment \(x^\ell \): Assuming that \(x\ \texttt {:=}^\ell e^{\ell _1}\) is the last assignment of x, the environment \(\varLambda \) maps x to \(x^\ell \) (we will use join operators when control flow branches will be considered). Then \(\mathcal {E}[e]\ \varLambda \) generates the set of constraints for the expression e in the environment \(\varLambda \). These constraints, defined in Fig. 6, are derived from equations of Sect. 4.1. For commands, labels are used to distinguish many assignments of the same variable or to implement joins in conditions and loops. Given a command c and an environment \(\varLambda \), \(\mathcal {C}[c]\ \varLambda \) returns a pair \((C,\varLambda ')\) made of a set C of constraints and of a new environment \(\varLambda '\). \(\mathcal {C}\) is defined by induction on the structure of commands in Fig. 7. These constraint join values at control flow junctions and propagate the accuracies as described in Sect. 2. In forward mode, accuracy decreases while in backward mode accuracy increases (we weaken pre-conditions and strengthen post-conditions).

5 Experimental Results

In this section we present some experimental results obtained with our prototype. Our tool generates the constraints defined in Sect. 4 and calls the Z3 SMT solver [21] in order to obtain a solution. Since, when they exist, solutions are not unique in general, we add an additional constraint related to a cost function \(\varphi \) to the constraints of Figs. 6 and 7. The cost function \(\varphi (c)\) of a program c computes the sum of all the accuracies of the variables and intermediary values stored in the control points of the arithmetic expressions, \( \varphi (c) = \sum _{x\in \text {Id},\ \ell \in \text {Lab}} \mathsf {acc}(x^\ell )\ +\ \sum _{\ell \in \text {Lab}} \mathsf {acc}(\ell )\). Then, by binary search, our tool searches the smallest integer P such that the system of constraints \((\mathcal {C}[c]\ \varLambda _\bot ) \cup \{ \varphi (c)\le P\}\) admits a solution (we aim at using an optimizing solver in future work [6, 25, 28]). In our implementation we assume that, in the worst case, all the values are in double precision, consequently we start the binary search with \(P\in [0, 52\times n]\) where n is the number of variables and intermediary values stored in the control points. When a solution is found for some P, a new iteration of the binary search is run with a smaller P. Otherwise, a new iteration is run with a larger P.

Fig. 8.
figure 8

Examples of mixed-precision inference. Source programs, inferred accuracies and formats. Top: \(3\times 3\) determinant. Middle: Horner’s scheme. Bottom: a PD controller.

Fig. 9.
figure 9

Measures of efficiency of the analysis on the codes of Figs. 2 and 8.

We consider three sample codes displayed in Fig. 8. The first program computes the determinant \(\det (M)\) of a \(3\times 3\) matrix \(M=\left( \begin{array}{ccc} a &{} b &{} c \\ d &{} e &{} f \\ g &{} h &{} i \end{array}\right) \). We have \(\small \det (M)= (a\cdot e\cdot i + d\cdot h\cdot c + g\cdot b\cdot f) - (g\cdot e\cdot c +a\cdot h\cdot f +d\cdot b\cdot i). \) The matrix coefficients belong to the ranges \(\left( \begin{array}{ccc} [-10.1,10.1] &{} [-10.1,10.1] &{} [-10.1,10.1] \\ { [ -20.1,20.1 ]} &{} [ -20.1,20.1 ] &{} [ -20.1,20.1 ] \\ { [ -5.1,5.1 ]} &{} [ -5.1,5.1 ] &{} [ -5.1,5.1 ] \end{array}\right) \) and we require that the variable det containing the result has accuracy 10 which corresponds to a fairly rounded half precision number. By default, we assume that in the original program all the variables are in double precision. Our tool infers that all the computations may be carried out in half precision.

The second example of Fig. 8 concerns the evaluation of a degree 9 polynomial using Horner’s scheme: \(\small p(x) = a_0+\left( x \times \big ( a_1 + x \times (a_2+ \ldots ) \big ) \right) \). The coefficients \(a_i, 0\le i\le 9\) belong to \([-0.2,0.2]\) and \(x\in [-0.5,0.5]\). Initially all the variables are in double precision and we require that the result is fairly rounded in single precision. Our tool then computes that all the variables may be in single precision but p which must remain in double precision. Our last example is a proportional differential controller. Initially the measure m is given by a sensor which sends values in \([-1.0,1.0]\) and which ensures an accuracy of 32. All the other variables are assumed to be in double precision. As shown in Fig. 8, many variables may fit inside single precision formats.

For each program, we give in Fig. 9 the number of variables of the constraint system as well as the number of constraints generated. Next, we give the total execution time of the analysis (including the generation of the system of constraints and the calls to the SMT solver done by the binary search). Then we give the number of bits needed to store all the values of the programs, assuming that all the values are stored in double precision (column #Bits-Init.) and as computed by our analysis (column #Bits-Optim.) Finally, the number of calls to the SMT solver done during the binary search is displayed. Globally, we can observe that the numbers of variables and constraints are rather small and very tractable for the solver. This is confirmed by the execution times which are very short. The improvement, in the number of bits needed to fulfill the requirements, compared to the number of bits needed if all the computations are done in double precision, ranges from \(57\%\) to \(83\%\) which is very important.

6 Conclusion

We have defined a static analysis which determines the floating-point formats needed to ensure a given accuracy. This analysis is done by generating a set of linear constraints between integer variables only, even if the programs contain non-linear computations. These constraints are easy to solve by a SMT solver.

Our technique can be easily extended to other language structures. For example, since all the elements of an array must have the same type, we just need to join all the elements in a same abstract value to obtain a relevant result. Similarly, functions are also easy to manage since only one type per argument and returned value need. Our analysis is built upon a range analysis performed before. Obviously, the precision of this analysis impacts the precision of the floating-point format determination and the inference of sharp ranges given by relational domains, improves the quality of the results. In future work, we aim at exploring the use a solver based on optimization modulo theories [6, 25, 28] instead of the non-optimizing solver coupled to a binary search used presently.