Keywords

1 Introduction

With the wide availability of processors with hardware floating-point units, many current critical applications, such as the critical control command systems for automotive, aeronautic, space, etc., which have stringent correctness requirements and whose failures have catastrophic consequences that endanger human life [1, 9], rely heavily on floating-point operations. Without any extensive background in numerical accuracy and computer arithmetic, developers tend to use the highest precision available in hardware (usually double precision). Despite the fact that the results will be more accurate, this increases significantly the application runtime, bandwidth capacity and the memory and energy consumption of the system. In fact, we denote by the term precision the amount of information used to represent a value while the term accuracy denotes how close a floating-point computation comes to the real value. The challenge is to use no more precision than needed wherever possible without compromising overall accuracy (using a too low precision for a given algorithm and data set leads to inaccurate results). To overcome the problem of determining the accuracy of floating-point computations, many efforts have been done in automating the choice of the best precision by dynamic or static methods [5, 10, 15, 16] but they differ strongly in their way of accuracy determination. In this article, we are interested in the problem of determining the minimal precision on the inputs and the intermediary results of a program performing floating-point computations in order to get a desired accuracy on the outputs. Often in these programs, it is possible to reduce the floating-point precision of certain variables in order to increase performance, for example, the throughput of single-precision floating-point operations is twice that of double-precision operations. Also, the proposed tool in this article aims to apply the mixed-precision on the floating-point programs formats. Mixed-precision computing [10] is an approach to combine different precisions for different floating-point variables (contrarily to the uniform precision). Our approach combines a forward and a backward error analysis which are two popular paradigms of error analysis, done by abstract interpretations [3]. In fact, the forward analysis is classical. It examines how errors are magnified by each operation aiming to determine the accuracy on the results [11]. Next, a user requirement is given denoting the final accuracy wanted on some control points of the outputs. By taking in consideration the user assertions and the results of the forward analysis, the backward analysis is a complementary approach that starts with the computed answer to determine the exact floating-point input that would produce it in order to satisfy the desired accuracy. As could be expected, the forward and backward analysis can be handled iteratively to refine the results until a fixed-point is reached. Next, these forward and backward transfer functions are expressed as a set of linear constraints made of propositional logic formulas and relations between integer elements only. After, these constraints will be easily checked by an SMT solver (Z3 is used in practice [7]).

The main contributions of this article are the following. First, we introduce refinements of the automated approach based on a static forward and backward analysis done in [11]. This approach will be explained in details specially for the cases of addition, the multiplication and the subtraction arithmetic expressions. Furthermore, our contribution revolves around the definition of the function \(\iota \), defined in [11] and redefined further in this work (see Fig. 2). The function \(\iota \) is equivalent to the carry bit that can occur throughout floating-point computations (generally \(\iota = 1\)). Intuitively, a too conservative static analysis would consider that a carry can be propagated at each operation, which corresponds to \(\iota = 1\). This function becomes very costly if we perform several computations at a time and therefore the errors would be considerable. It is then crucial to use the most precise function \(\iota \). This is why, we reexamine in this work this function by sorting out the different cases where this function might be equal to 1 or 0: difference in magnitude of two floating-point numbers and the superposition of the ulp and the ufp, defined in Sect. 3.1, of these two numbers relative to each other. After that, the previous analysis will be expressed as a set of propositional formulas on linear constraints between integer variables only (checked by Z3). The transformed program is guaranteed to use variables of lower precision with a minimal number of bits than the original program. Second, we present the steps of construction of our new tool, POP, which executes and evaluates any kind of programs with respect to our grammar of a simple imperative language and including the implementation of the proposed approach. Also, we present some experimental results showing the efficiency of our mixed-precision tool in determining the minimal precision required.

The rest of this article is organized as follows. Section 2 introduces briefly some basic concepts related to the floating-point arithmetic and the related work of some existing precision tuning tools and we finish by introducing the overview of our approach. Section 3 deals with the forward and backward static error analysis by constraints generation with some examples. The implementation of our tool and the constraints resolution are presented in Sect. 4 and an experimental results are given in Sect. 5 before concluding in Sect. 6.

2 Overview

To better explain what POP does, a motivating example of a floating-point program is given in Fig. 1 which implements a simple scalar product of two vectors x and y presented with different magnitude of small and large floating-point values. For the vectors x and y, the variable values belong to [1.0, 2.0], [10.0, 15.0] and [100.0, 110.0] for vector x and [100.0, 110.0], [5.0, 10.0] and [450.0, 500.0] for vector y, respectively. In this example, we suppose that all variables are in double precision before analysis (original program in the left hand side of Fig. 1) and that a range determination is performed by dynamic analysis on these variables to make sure that no overflow can arise. We generate at each node of our program syntactic tree a unique control point in order to determine easily the final accuracy, after the forward and backward analysis, as shown on the left side of Fig. 1. It is conceivable that our program contains several annotations. First, for example on the left hand side of Fig. 1, the variables \(\mathtt{x_1}\) and \(\mathtt{y_1}\) are initialized to the abstract values [1.0, 2.0] and [10.0, 15.0] (in double precision) respectively, annotated with their control points thanks to the following annotations \(\mathtt{x}_1^{|1|}\) = \({[}1.0,2.0{]}^{|0|}\) and \(\mathtt{y}_1^{|3|}\) = \({[}10.0,15.0{]}^{|2|}\). As well, we have the statement

$$\begin{aligned} \mathtt {require\_accuracy (v, 23)^{|40|}} \end{aligned}$$

which informs the system that the user wants to turn on variable v to the simple precision at this control point. As a consequence, the minimal precision needed for the inputs and intermediary results satisfying the user assertion is observed on the right side of Fig. 1. For example, the variables \(\mathtt{x}_1\) passed from the double into float precision thanks to the annotation \(\mathtt{x}_1^{\sharp 21} = [1.0, 2.0]^{\sharp 22}\) (a floating-point number in single precision has 22 accurate digits). The results obtained show that POP, for present, automates precision tuning and propagates the user requirement along the program inputs and intermediary results.

Fig. 1.
figure 1

Simple scalar product of two vectors program. The program on the left designs the initial program in double precision annotated with labels. On the right, the program after analysis annotated with the final accuracies at each label referring to the user requirement.

3 Preliminary Notions

This section provides some background on the IEEE754 Standard of floating-point arithmetic, formats, rounding modes, errors and the ufp and ulp functions. Noting that several definitions of ulp exist in literature [12].

3.1 Basics on Floating-Point Arithmetic

The IEEE754 Standard formalizes a binary floating-point number x in base \(\beta \) (generally \(\beta \) = 2) as a triplet made of a sign, a mantissa and an exponent as shown in Eq. (1), where s \(\in \) {-1,1} is the sign, m represents the mantissa, \(m = d_0.d_1 ... d_{p-1}\), with the digits \(0 \le d_i < \beta \), \(0 \le i \le p - 1\), p is the precision (length of the mantissa) and the exponent e \(\in \) \([e_{min}, e_{max}]\).

$$\begin{aligned} x = s.m.\beta ^{e-p+1} \end{aligned}$$
(1)
Table 1. Parameters defining basic format floating-point numbers

The IEEE754 Standard specifies some particular values for p, \(e_{min}\) and \(e_{max}\) [4]. Also, this standard defines binary formats (with \(\beta \) = 2) which are described in Table 1. Hence, the IEEE754 standard distinguishes between normalized and denormalized numbers. Indeed, the normalization of a floating-point number ensuring \(d_0\ne 0\) guarantees the uniqueness of its representation. Denormalized numbers make underflow gradual [13]. The IEEE754 standard defines also some special numbers. All these numbers are summarized in Table 2 (in Binary64). Moreover, the IEEE754 Standard defines four rounding modes for elementary operations over floating-point numbers which are: towards \(+\infty \), towards \(-\infty \), towards zero and towards the nearest denoted by \(\uparrow _{+\infty }\), \(\uparrow _{-\infty }\), \(\uparrow _{0}\) and \(\uparrow _{\sim }\), respectively. Henceforth, we present the ufp (unit in the first place) and ulp (unit in the last place) functions which express the weight of the most significant bit and the weight of the least significant bit, respectively. In practice, these functions will be used further in this article to describe the error propagation across the computations. The definition of these functions is given in Eqs. (2) and (3) defined in [11].

$$\begin{aligned} \textit{ufp}(x) = \min \{i \in \mathbb {Z} : 2^{i+1} > x \} = \lfloor \log _2(x) \rfloor \end{aligned}$$
(2)

Let p be the size of the significand, the ulp of a floating-point number can be expressed as shown:

$$\begin{aligned} \textit{ulp}(x) = \textit{ufp}(x)-p+1. \end{aligned}$$
(3)
Table 2. Numbers in double precision

3.2 Related Work

There have been many efforts to automate the process of determining the best floating-point formats. Darulova and Kuncak [5] proposed a static analysis method to compute errors propagation. If their computed bound on the accuracy satisfies the post-conditions then the analysis is run again with a smaller format and it stops until finding the best format. Contrarily to our proposed tool, all their values have the same format (uniform-precision). Other methods rely on dynamic analysis. By way of illustration, Precimonious is considered as a dynamic automated search based tool that evaluates and executes different mixed-precision configurations of the program to identify the best configuration that satisfies the error threshold [15]. Also, we mention the Blame Analysis [16], a novel dynamic method that speeds up precision tuning by combining concrete and shadow program execution. The analysis determines the precision of all operands such that a given precision is achieved in the final result. So as to be more efficient with significant reduction in analysis time than used by itself, Blame Analysis and Precimonious has been consolidated together and this combined approach has shown better results in term of program speedup compared to using Blame Analysis alone. Nonetheless, floating-point tuning of entire applications is not feasible yet, in this moment, by this method. Moreover, Lam et al. [10] instrument binary codes aiming to modify their precision without modifying the source codes. They also propose a dynamic search method to identify the parts of code where the precision should be modified. The major drawback of this tools is that the state space is exponential in the number of variables and exploring even a subset is very time-intensive.

Finally, there are various rigorous static analysis approaches that use interval and affine arithmetic or Taylor series approximations to analyze stability and to provide rigorous bounds on rounding errors. However, they do not scale very well and therefore have not been applied to high precision computing workloads. In this context, Chiang et al. [2] has proposed an approach which allocate a precision to the terms of only arithmetic expressions. Whereas they need to solve a quadratically constrained quadratic program to obtain their annotations. Also, Solovyev et al. [17] have proposed the FP-Taylor tool that implements a method to estimate round-off errors of floating-point computations called Symbolic Taylor Expansions.

4 Static Analysis by Constraints Generation

In this section, we refine the computations of the forward and backward transfer functions used by the POP tool for the cases of addition, product and subtraction done in [11]. These functions are defined using the unit in the first and last places introduced in Eqs. (2) and (3). Next, these functions will be formalized as a set of constraints made of propositional logic formulas and affine expressions among integers.

4.1 Forward and Backward Error Analysis

Forward Addition, Multiplication and Subtraction. Consequently, we introduce the forward transfer functions corresponding to the addition \(\overrightarrow{\oplus }\), product \(\overrightarrow{\otimes }\) and subtraction \(\overrightarrow{\ominus }\) of two floating-point numbers \(x \in \mathbb {F}_p\) and \(y \in \mathbb {F}_q\) where \(\mathbb {F}_p\) and \(\mathbb {F}_q\) denote two sets of floating-point numbers in accuracy p and q, respectively. In Eq. (4), the operands \(x_{p_{p'}}\) and \(y_{q_{q'}}\) and their results \(z_{r_{r'}}\) have respectively two parameters p, \({p'}\), q, \({q'}\) and r, \({r'}\) which denote the correct precision of the result and of the error, respectively. Other than that, in distinction to [11], we introduce the truncation errors in order to be more precise through our computations. We denote the truncation errors by \(\varepsilon _+\), \(\varepsilon _\times \) and \(\varepsilon _-\) for the addition, product and subtraction operations respectively.

Definition 1

The forward addition \(\overrightarrow{\oplus }\) is given as shown in Eq. (4):

$$\begin{aligned} \begin{aligned} \overrightarrow{\oplus }(x_{p_{p'}}, y_{q_{q'}}) = z_{r_{r'}} \quad where \quad r = \textit{ufp}(x_{p_{p'}} + y_{q_{q'}}) - \textit{ufp}(2^{\textit{ufp}(x_{p_{p'}}) - p + 1} + \\ 2^{\textit{ufp}(y_{q_{q'}}) - q + 1} + 2^{\textit{ufp}(z_{r_{r'}}) - \sigma _+} ) \end{aligned} \end{aligned}$$
(4)

In the sequel, we assume \(x_{p_{p'}}\) = x, \(y_{q_{q'}}\) = y and \(z_{r_{r'}}\) = z. Let v be an exact value computed in infinite precision and the floating-point value is such that \(\widehat{v} = d_0.d_1 ... d_{p-1} .2^{e}\) of \(\mathbb {F}_p\). The comparison of these two values is \(| v - \widehat{v}| \le 2^{e - p + 1}\). So, taking into account the definition of the function ufp in Eq. 2, we have for any \(x \in \mathbb {F}_p \) and \(y \in \mathbb {F}_q \) the error \(\varepsilon _{x}\) on x is bounded by:

$$\begin{aligned} \varepsilon _{x}< 2^{\textit{ulp(x)}} = 2 ^{ \textit{ufp}(x) - p + 1} \quad \text {and} \quad \varepsilon _{y} < 2^{\textit{ulp}(y)} = 2 ^{ \textit{ufp}(y) - q + 1} \end{aligned}$$
(5)

The truncation error for the rounding mode towards the nearest \(\uparrow _{\sim }\) defined by the IEEE754 Standard for the addition of x and y whose result is z is given by \(\varepsilon _+ \le 2^{\frac{1}{2}{} \textit{ulp}(z)}\) and we have \(\textit{ulp}(z) = \textit{ufp}(z) - \sigma _+ + 1\) where \(\sigma _+\) presents the precision of the operator \(+\). Thus, the truncation error is shown in Eq. 6:

$$\begin{aligned} \varepsilon _+ \le 2 ^{ufp(z) - \sigma _+} \end{aligned}$$
(6)

Definition 2

The forward product \(\overrightarrow{\otimes }\) is given as shown in Eq. (7):

$$\begin{aligned} \begin{aligned} \overrightarrow{\otimes }(x_{p_{p'}}, y_{q_{q'}}) = z_{r_{r'}} \quad where \quad r = \textit{ufp}(x_{p_{p'}} \times y_{q_{q'}}) - \textit{ufp}(2^{\textit{ufp}(x) + 1} . 2 ^ {\textit{ufp}(y) - q + 1} +\\ 2^{\textit{ufp}(y) + 1} . 2^{\textit{ufp}(x) - p + 1} + 2^{\textit{ufp}(x) - p + 1} . 2^{\textit{ufp}(y) - q + 1} + 2^{\textit{ufp}(z) - \sigma _\times }) \end{aligned} \end{aligned}$$
(7)

We assume that the error \(\varepsilon _{z_\times }\) of the multiplication of two floating-point numbers x and y whose result is z is \(\varepsilon _{z_\times } = y\;.\;\varepsilon _{x} + x\; . \;\varepsilon _{y} + \varepsilon _{x}\; . \; \varepsilon _{y} + \varepsilon _\times \) where \(\varepsilon _\times \) is the truncation error for the product and is equal to \(\varepsilon _\times \le 2^{ufp(z) - \sigma _\times }\) (for the rounding mode towards \(\uparrow _{\sim }\)) and where \(\sigma _\times \) represents the precision of the operator \(\times \). So, the error \(\varepsilon _{z_\times }\) could be bounded as shown in Eq. 8:

$$\begin{aligned} 2^{\textit{ufp}(x)} \le x< 2^{\textit{ufp}(x) + 1} \quad \text {and} \quad 2^{\textit{ufp}(y)} \le y < 2^{\textit{ufp}(y) + 1} \end{aligned}$$

and consequently,

$$\begin{aligned} \varepsilon _{z_\times }<2^{\textit{ufp}(x)+1}.2^{\textit{ufp}(y)-q+1}+2^{\textit{ufp}(y+1}.2^{\textit{ufp}(x)-p+1}+2^{\textit{ufp}(x)-p+1}.2^{\textit{ufp}(y)-q+1}+ 2^{\textit{ufp}(z) - \sigma _\times } \end{aligned}$$
$$\begin{aligned} <2^{\textit{ufp}(x) + \textit{ufp}(y) - q + 2} + 2^{\textit{ufp}(x) + \textit{ufp}(y) - p +2} + 2^{\textit{ufp}(x) + \textit{ufp}(y) - p - q +2} + 2^{\textit{ufp}(z) - \sigma _\times } \end{aligned}$$

thus,

$$\begin{aligned} \varepsilon _{z_\times } \le 2^{\textit{ufp}(x) + \textit{ufp}(y) - q + 1} + 2^{\textit{ufp}(x) + \textit{ufp}(y) - p + 1} + 2^{\textit{ufp}(x) + \textit{ufp}(y) - p - q + 1} + 2^{\textit{ufp}(z) - \sigma _\times }. \end{aligned}$$
(8)

Definition 3

The forward subtraction \(\overrightarrow{\ominus }\) is given as shown in Eq. (9):

$$\begin{aligned} \begin{aligned} \overrightarrow{\ominus }(x_{p_{p'}}, y_{q_{q'}}) = z_{r_{r'}} \quad where \quad r = \textit{ufp}(x_{p_{p'}} - y_{q_{q'}}) - \textit{ufp}(2^{\textit{ufp}(x) - p + 1} - 2^{\textit{ufp}(y) - q + 1} - \\2^{\textit{ufp}(z) - \sigma _-}) \end{aligned} \end{aligned}$$
(9)

Using the same approach in the addition case, we have \(2^{\textit{ufp}(x)} \le x < 2^{\textit{ufp}(x) + 1}\) and \(2^{\textit{ufp}(y)} \le y < 2^{\textit{ufp}(y) + 1}\) and the truncation error \(\varepsilon _- \le 2^{\textit{ufp}(z) - \sigma _-}\) where \(\sigma _-\) is the precision of the operator -. The subtraction error between x and y is bounded as mentioned in Eq. (9).

Backward Addition, Subtraction and Multiplication. Equivalently, we introduce the backward transfer functions \(\overleftarrow{\oplus }\), \(\overleftarrow{\otimes }\) and \(\overleftarrow{\ominus }\) which take advantage of the forward transfer functions and of the accuracy requirement on the results and by combining these two findings it is then possible to lower the number of bits needed for one of the operands. We consider that x is unknown where the result z and the operand y are known. The backward functions for the proposed arithmetic functions are given in the following properties.

Definition 4

The backward transfer function for the addition \(\overleftarrow{\oplus }\) is given as shown:

$$\begin{aligned} \overleftarrow{\oplus }(z, y) = (z - y)_{p_{p'}} \quad with \quad p = \textit{ufp}(z - y) - \textit{ufp}(2 ^{ \textit{ufp}(z) - r + 1} - 2^{\textit{ufp}(y) - q + 1}- 2^{\textit{ufp}(x) - \sigma _{+}}) \end{aligned}$$
(10)

To apply the backward analysis, we assume that one of the operands is unknown (x in our case) while the result z is known. Then, we compute the precision p of the operand x with respect to the user accuracy requirement and the forward analysis result. As we said, the result and the operand errors can be bounded by \(\varepsilon _{z_+} < 2 ^{\textit{ufp}(z) - r + 1}\) and \(\varepsilon _y < 2 ^{\textit{ufp}(y) - q + 1}\) and for the truncation error is given as \(\varepsilon _+ \le 2 ^{\textit{ufp}(x) - \sigma _+}\).

Definition 5

We present the backward transfer function for the multiplication \(\overleftarrow{\otimes }\) as shown:

$$\begin{aligned} \begin{aligned} \overleftarrow{\otimes }(z,y)=(z \div y)_{p_{p'}} \quad with \quad \quad \quad \quad \quad \quad \quad \quad \\ p = \textit{ufp}(z \div y) - \textit{ufp} \big (\frac{2^{\textit{ufp}(y) + 1}. 2^{\textit{ufp}(z) - r + 1} - 2^{\textit{ufp}(z) + 1} . 2^{\textit{ufp}(y) - q + 1}}{2^{\textit{ufp}(y) + 1}( 2^{\textit{ufp}(y) + 1} + 2^{\textit{ufp}(y) - q + 1}} - 2^{\textit{ufp(x)} - \sigma _\times \big )} \end{aligned} \end{aligned}$$
(11)

In the case of product, we know that \(\overleftarrow{\otimes }(z,y)=(z \div y)_{p_{p'}}\) with \(p = \textit{ufp}(z \div y) - \textit{ufp}(\varepsilon _{z_\times })\) and where the truncation error \(\varepsilon _\times \le 2^{\textit{ufp}(x) - \sigma _-}\) and the error \(\varepsilon _{z_\times }\) is bounded as it is shown in Eq. (11).

Definition 6

$$\begin{aligned} \overleftarrow{\ominus }(z, y)=(z + y)_{p_{p'}}\quad with \quad p = \textit{ufp}(z + y) - \textit{ufp}(2^{\textit{ufp}(z) - r + 1} + 2^{\textit{ufp}(y) - q + 1} + 2^{\textit{ufp}(x) - \sigma _-}) \end{aligned}$$
(12)

We know that the roundoff errors are bounded as \(\varepsilon _z < 2^{\textit{ufp}(z) - r + 1}\) and \(\varepsilon _y < 2^{\textit{ufp}(y) - q + 1}\) and the truncation error \(\varepsilon _- \le 2 ^{\textit{ufp}(x) - \sigma _-}\) where \(\sigma _-\) denotes the precision of the operator - and the error in Eq. (12) is given as \(\varepsilon _{z_-} = \varepsilon _x - \varepsilon _y - \varepsilon _-\).

Obviously, our static analysis does not work on scalar values as in Eqs. (4) to (12) but on intervals instead. As described in [11], we abstract sets of values of \(\mathbb {F}_p\) using the following connection in Eq. (13) where an element \(i^\sharp \in \mathbb {I}_p\) correspond to \(i^\sharp = [\underline{f},\overline{f}]_p\) is defined by two floating-point numbers and an accuracy p.

$$\begin{aligned} \mathbb {I}_p \ni [\underline{f}, \overline{f}]_p = \{f \in \mathbb {F}_p : \underline{f} \le f \le \overline{f}\}\quad with \quad \mathbb {I} = \bigcup _{p \in \mathbb {N}}{\mathbb {I}_p}. \end{aligned}$$
(13)

The operations \(\overrightarrow{\oplus }^\sharp \), \(\overleftarrow{\oplus }^\sharp \), \(\overrightarrow{\otimes }^\sharp \) and \(\overleftarrow{\otimes }^\sharp \) among values of \(\mathbb {I}_p\) are defined in [11] in function of \(\overrightarrow{\oplus }\), \(\overleftarrow{\oplus }\), \(\overrightarrow{\otimes }\) and \(\overleftarrow{\otimes }\). For the rest of the article, we deal with the generation of constraints only for the addition and the product.

4.2 Constraints Generation

In this section, we describe how to generate constraints to determine the lowest precision on variables and intermediary values in programs. An important definition of the function \(\iota \), computed on floating-point numbers, is given in this section. By this definition, we attempt to be far more efficient in the way we propagate errors across the arithmetic operations. The methodical difference between the function \(\iota (u, v)\) proposed in [11] and our new definition \(\iota (t, u, v, w)\) is that we take in consideration the ufp and ulp of the two operands in order to compare the two floating-point number errors \(\alpha \) and \(\beta \) and we add an extra bit only if we are certain that \(\textit{ulp}(\alpha )\) is lesser than the \(\textit{ufp}(\beta )\) (0 otherwise). Compared to the former definition of [11], our new definition improves significantly the accuracy of the static analysis by being less pessimistic. As mentioned earlier, the transfer functions previously seen in Sect. 4.1 are not translated directly into constraints because the resulting system would be too difficult to solve and contain non-linear constraints. Therefore, we reduce the problem to a constraint system consisting in propositional formulas on linear relations between integer elements only. In what follows, we introduce the constraints that we generate for the arithmetic expressions in which we are interested.

Forward Operations. Back to Eqs. (4) to (12), our goal is to compute the correct precision r and the precision \(r'\) of the result error (\(\varepsilon _{z_+}\) for the addition and \(\varepsilon _{z_\times }\) for the product) for the floating-point number z. Intuitively, we compute \(z = x + y\) with related errors \(\varepsilon _x\) and \(\varepsilon _y\) and \(\varepsilon _{z_+}\) and we want to compute \(\textit{ufp}(\varepsilon _{z_+})\) in function of the errors on the operands.

Proposition 1

Let x in \(\mathbb {F}_{p}\) and y in \(\mathbb {F}_{q}\) and let z the result of the addition operation between these two floating-point numbers. We have in the worst case a carry bit that can occur through this operation as it has been proven in [12].

$$\begin{aligned} \textit{ufp}(z) \le max (\textit{ufp}(x), \textit{ufp}(y)) + 1 \end{aligned}$$
(14)

As a matter of fact, the previous Eq. (14) is considered as correct but pessimistic (too large over-approximation) due to the fact that adding an extra bit specially for cases we would not to, becomes very costly if we perform several computations. In previous work [11], a new function \(\iota \) was presented in order to refine Eq. (14): they compare the unit in the first places of the operands and they add an extra bit only if they are equal which is correct but it misses exactness. In this work, we present our new definition of function \(\iota \). In fact, let x in \(\mathbb {F}_{p}\) and y in \(\mathbb {F}_{q}\), our strategy is to compare \(\textit{ulp}(x)\) with \(\textit{ufp}(y)\) and conversely (\(\textit{ulp}(y)\) with \(\textit{ufp}(x)\)). In Definition 7, we present function \(\iota \) and in Fig. 2 we present an example of cases of function \(\iota \) where an extra bit can occur (\(\iota = 1\)) or not (\(\iota = 0\)).

Definition 7

We introduce the function \(\iota (t, u, v ,w)\) as the exceeding of 1 bit that can occur in operations between the floating-point numbers.

$$\begin{aligned} \iota (t, u, v ,w) = {\left\{ \begin{array}{ll} 0 &{} u>t ~~ or~~ w >u, \\ 1 &{} \text {otherwise.} \end{array}\right. } \end{aligned}$$
(15)

Proposition 2

In order to compute the function \(\iota \), we need to compute the unit in the last places \(\textit{ulp}(\alpha )\) and \(\textit{ulp}(\beta )\). Considering \(p'\) the precision of \(\alpha \), from Eq. (3) we have \(\textit{ulp}(\alpha ) = \textit{ufp}(\alpha ) - p' + 1\) where \(\textit{ufp}(\alpha ) = \textit{ufp}(x) - p\). Consequently, we obtain that:

$$\begin{aligned} \textit{ulp}(\alpha ) = \textit{ufp}(x) - p - p' + 1 \end{aligned}$$
(16)

We know from Eq. (3) that for \(x \in \mathbb {F}_p\) the unit in the last place is \(\textit{ulp}(x) = \textit{ufp}(x) - p + 1\). This definition is also valid for \(\textit{ulp}(\varepsilon _x)\) with \(p'\) the precision of \(\varepsilon _x\) and also we deduce that if \(\textit{ulp}(x) = \textit{ufp}(x) - p + 1\) than \(\textit{ufp}(\varepsilon _x) = \textit{ufp}(x) - p\) and than we obtain the result in Eq. (16).

Fig. 2.
figure 2

Definition of function \(\iota \). The figure on the left represents the case of \(\iota (\alpha , \beta )\) = 1 and so an exceeding bit can occur throughout computations. The figure on the right is equivalent to \(\iota (\alpha , \beta )\) = 0

Forward Addition: From Definition 7, Eq. (5) and Eq. (14), we present Proposition (3). As we said before, if we sum \(z = x + y\) the error is equal to \(\varepsilon _{z_+} = \varepsilon _x + \varepsilon _y + \varepsilon _+\). Now, in order to apply the definition of the function \(\iota \), we will disassociate the total error \(\varepsilon _{z_+}\) into two errors: the roundoff error \(\varepsilon _{xy} = \varepsilon _x + \varepsilon _y\) and the truncation error \(\varepsilon _+\). Also, we will manage by presenting one case of the \(\iota \) function (\(u > t\)).

Proposition 3

Let \(a = \textit{ufp}(x)\), \(b = \textit{ufp}(y)\) and \(c= \textit{ufp}(z)\),

$$\begin{aligned} \textit{ufp}(\varepsilon _{xy}) < \max (a - p + 1, b - q + 1) + \iota (a - p - p' + 1, b - q) \end{aligned}$$
(17)

Taking into account Eq. (17) above, \(\textit{ufp}(\varepsilon _{z_+})\) is then bounded by:

$$\begin{aligned} \textit{ufp}(\varepsilon _{z_+}) < \max \big (\max \big ((a - p + 1, b - q + 1) + \iota (a - p - p' + 1, b - q), c -\sigma _+\big )\big ) + \iota (a - p - p' + 1, b - q) \end{aligned}$$
(18)

which implies that the precision of the result z in this addition is

$$\begin{aligned} r = \textit{ufp} (x + y) - \max \big (\max \big ((a - p + 1, b - q + 1) + \iota (a - p, b - q), c -\sigma _+\big )\big ) - \iota (a - p - p' + 1, b - q). \end{aligned}$$
(19)

Proof

Formally, let \( \alpha = \sum \limits _{\begin{array}{c} i=n_0 \end{array}}^{n_1}{\alpha _i 2^i}\) and \( \beta = \sum \limits _{\begin{array}{c} i=m_0 \end{array}}^{m_1}{\beta _i 2^i}\) two floating-point numbers. Let us assume that \(n_1 < m_0\). From Definition 7, we have \(ufp(\alpha ) = n_1\) and \(ulp(\beta ) = n_0\) then:

$$\begin{aligned} \alpha + \beta = \sum \limits _{\begin{array}{c} i=n_1 \end{array}}^{m_0}{\gamma _i 2^i} \quad where \quad \gamma _i \quad = {\left\{ \begin{array}{ll} \alpha _i &{}\text {if} \quad i \in [n_0, n_1],\\ \beta _i &{}\text {if}\quad i \in [m_0, m_1]\\ 0 &{}\text {otherwise}. \end{array}\right. } \end{aligned}$$

Finally, we conclude that \(ufp(\varepsilon _{z_{+}}) = m_1\). In the case where \(n_0 > m_1\), we deduce that \(ufp(\varepsilon _{z_{+}}) = n_1\). After, from Eq. (18), we substitute the new refinement over-approximation of the total error \(\varepsilon _{z_+}\) and consequently we deduce the precision r in Eq. (19).

Now, what remains to be done is to determine the precision of the error \(r'\) of the addition. That’s why, we need to compute \(\textit{ulp}(\epsilon _{z_+})\) as it is shown in Eq. (20). In the case of addition, we present \(\textit{ulp}(\varepsilon _{z_+})\) as the smallest ulp between the two operands errors (\(\textit{ulp}(\varepsilon _x)\) and \(\textit{ulp}(\varepsilon _y)\)) and we conclude finally that the precision of the error \(r' = \textit{ufp}(\epsilon _{z_+}) - \textit{ulp}(\epsilon _{z_+})\).

$$\begin{aligned} \textit{ulp}(\varepsilon _{z_+}) = min \big (\textit{ulp}(\varepsilon _{x}), \textit{ulp}(\varepsilon _{y})\big ) \end{aligned}$$
(20)

Forward Multiplication. For the multiplication case, we apply our new Definition 7 and Eq. (8) and we present Proposition 4.

Proposition 4

Let a and b and c three integers with \(a = \textit{ufp}(x)\), \(b = \textit{ufp}(y)\) and \(c = \textit{ufp}(z)\). We apply the same proceeding as in the forward addition, we dissociate the total error \(\varepsilon _\times \) into the roundoff error \(\varepsilon _{xy} = \varepsilon _x + \varepsilon _y\) and the truncation error \(\varepsilon _\times \). So, we have:

$$\begin{aligned} \textit{ufp}(\varepsilon _{xy}) < \max (a + b - p + 1, a + b - q + 1) + \iota (a - p - p' + 1, b - q) \end{aligned}$$
(21)

and then the total error \(\textit{ufp}(\varepsilon _{z\times })\) is given as

$$\begin{aligned} \textit{ufp}(\varepsilon _{z_\times }) < \max \big (\max \big ((a - p + 1, b - q + 1) + \iota (a - p - p' + 1, b - q), c -\sigma _+\big )\big ) + \iota (a - p - p' + 1, b - q) \end{aligned}$$
(22)

and then we deduce that

$$\begin{aligned} r = \textit{ufp} (x \times y) - \max \big (\max \big ((a - p + 1, b - q + 1) + \iota (a - p, b - q), c -\sigma _+\big )\big ) - \iota (a - p - p' + 1, b - q). \end{aligned}$$
(23)

Next, like we have proceed in Eq. (20) in the case of addition we may say that the unit in the last place of \(\varepsilon _{z_\times }\) is defined by

$$\begin{aligned} \textit{ulp}(\varepsilon _{z_\times }) = \textit{ulp}(\epsilon _{x}) + \textit{ulp}(\epsilon _{y}) \end{aligned}$$
(24)

By reasoning in the same way, we linearize the computations for the backward operations (addition and multiplication).

Backward Addition: We consider now the backward transfer functions, depending on Eq. (10) for the addition case. We know that \(p = ufp(z - y) - ufp(\varepsilon _{z} -\varepsilon _y - \varepsilon _+)\). So, again let \(c= ufp(z)\) we can over-approximate \(\varepsilon _{z}\) thanks to the relations \(\varepsilon _{z} < 2^{c - r +1}\), \(\varepsilon _y \ge 0 \) and \(\varepsilon _+ \ge 0\) and consequently

$$\begin{aligned} p = ufp(z - y) - c + r. \end{aligned}$$
(25)

Backward Multiplication: Again, we take \(a = ufp(x)\), \(b = ufp(y)\) and \(c = ufp(z)\). From Eq. (11), we know that \(2^{c}\le z < 2^{c+1}\), \(2^{b}\le y < 2^{b+1}\) and \(\varepsilon _{z_\times } < 2^{c-r+1}\), \(\varepsilon _{y} < 2^{b-q+1}\) which implies that \(y.\varepsilon _{z_\times }- z.\varepsilon _{y}<2^{c+b-r+2}-2^{b+c-q+2}\) and that

$$\begin{aligned} \frac{1}{y.(y+\varepsilon _{y})}<2^{-2b}. \end{aligned}$$

Consequently,

$$\begin{aligned} \varepsilon _{z_\times } \le 2^{-2b}.(2^{c+b-r+2} - 2^{b+c-q+2}) - 2^{a-\sigma _\times } \le 2^{c-b-r+1} - 2^{c-b-q+1} - 2^{a-\sigma _\times } \end{aligned}$$

and finally,

$$\begin{aligned} p = ufp(z \div y)- \max (\max (c-b-r+1,c-b-q+1), a-\sigma _\times ). \end{aligned}$$
(26)

5 The POP Tool

In this section, we present our tool, POP: Precision OPtimizer. We present its architecture, its input including the program file annotated with the developer accuracy expectation, parameters and its outputs. Also, we illustrate the mechanism followed by POP to lower the precision of the floating-point programs.

5.1 Architecture

At this stage, we present the main architecture of POP also described in Fig. 3. POP is written in JAVA while each expression, boolean and statement presented in Fig. 4 are represented as packages gathering the different classes of their definition. We can illustrate the tool hierarchy as follows:

  • Parser: It takes a file of a floating-point program referring to our simple imperative language. Before evaluating our program, we call the ANTLR: (ANother Tool for Language Recognition) [14] framework in order to generate, from a grammar file, a parser that can build and walk parse tree.

  • Range determination: Consists in launching the execution of the program a certain number of times in order to determine dynamically the range of variables (we plan to use a static analyzer in the future).

  • Constraints generation: It implements the forward and backward error analysis transfer function seen in Sect. 4 where the main semantics are detailed in [11]. In addition to the variables of accuracy assigned to each label \(\ell \) which are \(acc_F(\ell )\), \(acc_B(\ell )\) and \(acc(\ell )\) (defined in Sect. 5.2), we add new constraints relative to the ulp and the precision of the error in order to compute correctly the function \(\iota \) discussed in Sect. 4.2.

  • Constraints resolution: Firstly, we call the Z3 SMT solver [6] to find a solution for our constraints and we implement a cost function (see Sect. 6) to refine the solutions obtained in term of optimality. In future work, we will explore a new resolution method based on policy iterations [8]. Concerning the complexity of the analysis performed by POP, in practice, the analysis is carried out by the SMT solver which solves the constraints. The number of variables and constraints is linear in the size of the program. The complexity to analyze a program of size n is then equivalent to that of solving a system of n constraints in our language of constraints (by the solver).

5.2 Simple Imperative Language of Constraints

In order to explain the constraints generation, we introduce the following simple imperative language. As it is mentioned in Fig. 4, we assign to each element of our language (expression, boolean and statement) a unique label \(\ell \in lab\) with the intention of identifying without ambiguity each node of the syntactic tree. The same strategy as in [11] is adopted, the statement \(\mathtt {require\_accuracy(x, n)^\ell }\) denotes the accuracy that x must have at the control point \(\ell \). Therefore, we assign to each control point \(\ell \) three integer variables corresponding to the forward, the backward and the final accuracies so that the inequality in Eq. (27) is verified. Hence, we notice that in the forward mode, the accuracy decreases contrarily to the backward mode when we strengthen the post-conditions (accuracy increases).

$$\begin{aligned} 0 \le acc_B(\ell ) \le acc(\ell ) \le acc_F(\ell ) \end{aligned}$$
(27)
Fig. 3.
figure 3

POP mixed-precision analysis architecture

Fig. 4.
figure 4

Simple imperative language of constraints

6 Experimental Results

In this section, we aim at evaluating the performance of POP which generates the constraints defined in Sect. 4.2 and calls the Z3 SMT solver in order to obtain a solution. The solutions returned by Z3 are not unique due to the fact that it is not an optimizer but a solver. To surpass this limitation, we add to our global system of constraints an additional constraint related to a cost function \(\phi \) (we take the same definition in [11]). The purpose of a cost function \(\phi (c)\) of a given program c is to compute the sum of the accuracies of all the variables and the intermediary values collected in each label of the arithmetic expressions as it is shown in Eq. (28).

$$\begin{aligned} \phi (c) = \sum _{x \in Id, \ell \in Lab}acc(x^\ell ) + \sum _{ \ell \in Lab} acc(\ell ) \end{aligned}$$
(28)

After, our tool searches the smallest integer P such that our system of constraints admits a solution. Consequently, we start the binary search with P \(\in \) [0,52 \(\times \) n] where all the values are in double precision and where n is the number of terms in Eq. (28). While a solution is found for a given value of P, a new iteration of the binary search is run with a smaller value of P. When the solver fails for some P, a new iteration of the binary search is run with a larger P and we continue this process until convergence. We ran our precision-tuning analysis on programs that perform sum and product operations only (for now) to show the performances of our forward and backward analysis described in Sect. 4.2. Noting that these operations are widely used in embedded systems, graphic processing, finance, etc. We take into consideration two examples which consist in a rotation matrix-vector multiplication and the computation of the determinant of 3 \(\times \) 3 matrices and we present in Fig. 5 some measures of the efficiency of our analysis on these two examples. We assume that in the original programs of our examples all the variables are in double precision.

Rotation Matrix-Vector Multiplication

Our first example consists in a rotation matrix R which is used in the rotation of vectors and tensors while the coordinate system remains fixed. For instance, we want to rotate a vector around the z axis by angle \(\theta \). The rotation matrix and the rotated column vectors are given by:

$$\begin{aligned} \begin{bmatrix} cos\theta &{} -sin\theta &{} 0\\ sin\theta &{} cos\theta &{} 0\\ 0 &{} 0 &{} 1 \end{bmatrix} \begin{bmatrix} x\\ y\\ z \end{bmatrix} = \begin{bmatrix} x'\\ y'\\ z' \end{bmatrix} \end{aligned}$$

We aim from this experimentation to compute the performance of our POP tool from different angles of rotation \(\frac{\pi }{3}\), \(\frac{\pi }{4}\) and \(\frac{\pi }{6}\), a variety of input vectors chosen with difference in magnitude \(A = [1.0, 2.0 ,3.0]\), \(B = [10.0, 100.0, 500.0]\), \(C = [100.0, 500.0, 1000.0]\), \(D = [-100.0, -10.0, 1000.0]\), \(E = [1.0, 2.0, 500.0]\) and \(F = [1.0, 500.0, 10000.0]\) and for different user accuracy requirements 10, 15, 20, 25, 30 and 35. This example generates 858 constraints and 642 variables which are very manageable by the Z3 solver. Initially starting with 10335 bits for the original program (only variables in double precision), Fig. 5c shows that the improvement, in the number of bits needed to realize the user requirements, compared to the initial number of bits, ranges from 38 % to 87 % which confirms the usefulness of our analysis. Also, we can observe in Fig. 5e that the majority of variables fits in single precision format for an accuracy \(\le \) 35 and that no double precision variables are noticed for vectors A, B, C, D and E for an accuracy 15. For this example, we found that the variation of the angles of rotation do not have impact on the number of double precision variables after analysis that’s why we choose only the angle \(\frac{\pi }{4}\) in Fig. 5e and by modifying the magnitude of the vectors at every turn. Besides, POP assigns zeros to the accuracies of the variables that are not used by the program.

Fig. 5.
figure 5

Measures of the efficiency of the analysis on the two input examples: the time execution measure, the optimization of the number of bits of the transformed programs compared to the original ones and the percentage of the double precision variables after analysis.

Determinant of 3 \(\times \) 3 Matrices

Our second example computes the determinant det(M) of a 3 \(\times \) 3 matrices M1, M2 and M3 as shown:

The matrices coefficients belong to multiple magnitude ranges: , and . With 686 number of variables and 993 generated constraints, POP finds the minimal precision of the inputs and intermediary results for this example in less than 0.3 s as it is observed in Fig. 5b (time only for the resolution of the system of constraints and the calls of the Z3 SMT solver done by binary search) for different requirements of accuracy. Hence, as viewed in our first example, the final number of bits of the transformed program compared to 9964 initial bits is considerable as shown in Fig. 5d. Finally, we notice that our analysis succeeded in turning off almost the double precision variables to a fairly rounded single precision ones for an accuracy \(\le \) 20.

7 Conclusions and Future Work

In this article, we have introduced POP, an automated tuning tool for floating-point precision that computes the minimal number of bits needed for the variables and intermediary results in order to accomplish the user requirement of accuracy. Also, we have explained in details our forward and backward static analysis, done by abstract interpretation. Moreover, we have shown that we can express our analysis as a set of constraints made of propositional logic formulas and relations between affine expressions over integers which can be easily checked by an SMT solver. Obviously, our approach can be extended to other language structures in particular arrays and functions. Besides, we have considered that a range determination is performed by dynamic analysis on the variables of our programs and that no overflow arises during our analysis but from this time on we would like to adopt a static analyzer in order to infer safe ranges on our variables.

In future work, we would like to explore the policy iteration method [8] as a replacement for the non-optimizing solver (Z3) coupled to a binary search used in this article. In fact, we aim to apply the policy iteration method to improve the accuracy. The principle consists in transforming all the generated constraints to the form of min-max of discrete affine maps. Further, it will be interesting to feed the policy iteration with the Z3 solution as an initial policy and consequently comparing the solutions of these two methods in term of execution time and optimality. Nevertheless, our goal is to validate experimentally our tool on codes from various fields including safety-critical systems such as control systems for vehicles, medical equipment and industrial plants. Also, we are currently working on exploring the precision tuning in a new unexplored domain, Internet of Things. In fact, the type of problems of energy consumption and memory saving are widespread in this area that is why we are working on tuning the precision of the basic buildings of common IoT items such as accelerometers and gyroscopes. Conclusively, comparing our tool to other existing tools in the matter of analysis time and speed and the quality of the solution is a tremendous challenge to examine.