Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

The increasingly urgent demand for reliable software has led to tremendous advances in automatic program analysis and verification [4,5,6, 8, 18]. However, these techniques have typically focused on integer programs, and do not apply to the floating-point computations we depend on for safety-critical control in avionics or medical devices, nor the analyses carried out by scientific and computer-aided design applications. In these contexts, floating-point accuracy is critical since subtle rounding errors can lead to significant discrepancies between floating-point results and the real results developers expect. Indeed, floating-point arithmetic is notoriously unintuitive and its sensitivity to roundoff errors makes such computations fiendishly difficult to debug. Traditionally, such errors have been addressed by numerical methods experts who manually analyze and rewrite floating-point code to ensure accuracy and stability. However, these manual techniques are difficult to apply and typically do not lead to independently checkable certificates guaranteeing program accuracy.

The research community has responded to these challenges by developing a rich array of automated techniques that provide guaranteed bounds on the accuracy of floating-point computations or attempt to automatically improve accuracy. For example, Fluctuat [12, 13], used in many companies, performs static analysis of C programs to prove a bound on the rounding errors introduced by the use of floating-point numbers instead of reals. Fluctuat also helps users debug floating-point errors by detecting the operations responsible for significant precision loss. Salsa [10] automatically improves the numerical accuracy of programs by using an abstract interpretation to guide transformations that minimize the errors arising during computations. Herbie [21] uses a heuristic search to improve the numerical accuracy of an arithmetic expression by estimating and localizing the roundoff errors of an expression using sampled points, applying a set of rules in order to improve the accuracy of the expression and combining these improvements for different input domains. Rosa [11] combines an exact SMT solver on reals with sound affine arithmetic to verify accuracy post-conditions from assertions about the accuracy of inputs. Rosa can guarantee that the desired precision can be soundly obtained in a finite-precision implementation when propagation error is included. Finally, FPTaylor [24] improves on interval arithmetic by using Taylor series to narrow the computed error bounds.

As the number of tools dedicated to analyzing and improving numerical accuracy grows, it becomes increasingly difficult to make fair comparisons between the techniques. This is because each tool is targeted to slightly different domains, uses slightly different formats for expressing benchmarks, and reports results using related but slightly different measures. Furthermore, without any standard set of floating-point benchmarks, it is difficult to identify opportunities for composing complementary tools.

To address these challenges, the floating-point research community needs a standard benchmark format and common set of measures that enables comparison and cooperation between tools. This goal is motivated by the success of standard benchmark suites like SPEC [17] and SPLASH-2 [25] in the compiler community, the DIMACS [1] format in the SAT-solving community, and the SMT-LIB [3] format in the SMT-solving community. The formats have enabled fair comparisons between tools, crisp characterizations of the tradeoffs between different approaches, and useful cooperation between tools with complementary strengths.

In this article, we propose FPBench, a general floating-point format and benchmark suite. FPBench describes a common format and a suite of accuracy measurements; we envision floating-point tools taking in FPBench formatted programs and using the FPBench accuracy measures to describe accuracy. This allows users to combine tools that perform complementary tasks and compare competing tools to choose the one best for their task. The common scientific methodology FPBench enables is crucial for demonstrating the improvements of each tool on the state of the art.

The main contributions of this article are the following:

  1. (i)

    A uniform format for expressing floating-point benchmarks, FPCore.

  2. (ii)

    A set of utilities for converting to and from FPCore programs, and working with FPCore programs.

  3. (iii)

    A set of measures on which to evaluate various floating-point tools on FPBench benchmarks.

  4. (iv)

    An initial suite of benchmarks drawing from existing floating-point literature.

The remainder of this article is organized as follows. Section 2, describes the FPBench formats. Section 3 describes the accuracy measures. Section 4 describes the utilities FPBench provides to support creating and working with benchmarks. Section 5 surveys our existing benchmark suite, highlighting representative case studies from recent tools in the literature. Finally, Sect. 6 discusses future work and concludes.

2 Benchmark Format

FPBench provides a common input format for floating-point tools. This format makes it possible to develop a collection of floating-point benchmarks common to our research community, and also allows users to compose floating-point tools. So, a common floating-point benchmark format must be easy to parse, have simple and clear semantics, support floating-point details, and be expressive enough for many domains. To satisfy these requirements, FPBench provides the FPCore format, a minimal expression-based language for floating-point computations. A common floating-point benchmark format must also be easy to translate to and from popular industrial languages like C, C++, Matlab, and Fortran. To satisfy these requirements, FPBench also provides the extended FPImp format, a basic imperative language for floating-point computations which can be automatically compiled to FPCore.

2.1 FPCore

FPCore is an S-expression format featuring mathematical functions, if statements, and while loops. All floating-point functions from C11’s math.h and all Fortran 2003 intrinsics are supported operators, as well as standard arithmetic operators like addition and comparison; likewise, all constants defined in C11’s math.h are available as constants. Following IEEE754 and common C and Fortran runtimes, FPCore does not prescribe the accuracy of built-in mathematical functions. However, individual benchmarks can declare the accuracy they assume for built-in operations which analyzers can take into account.

A FPCore benchmark specifies a set of inputs, a floating-point expression, and optional meta-data flags such as a name, citations, preconditions on inputs, and the floating-point precision (binary32, binary64, ...) used to evaluate that benchmark. The full FPCore syntax is as follows:Footnote 1

figure a

Since the language is S-expression based, parentheses and braces are literals. The semantics of these programs is ordinary function evaluation, with let bindings evaluated simultaneously and while loops evaluated by simultaneously updating the loop variables until the condition is true, and then evaluating the return value:

figure b

The optional properties are used to record additional information about each benchmark. Benchmarks are currently annotated with a :name, a :description of the benchmark and its inputs, the floating-point :type (either binary32 or binary64), a precondition :pre, and a citation :cite. All FPBench tools ignore unknown attributes, so they represent an easy way to record additional benchmark information. We recommend that properties specific to a single tool be prefixed with the name of the tool.

2.2 FPImp

Where FPCore is the format consumed by tools, FPImp is a format for simplifying the translation from imperative languages to FPCore. While FPCore is a functional language with a minimal set of features for representing floating-point computations, FPImp includes common imperative features like variable assignments, multiple return values, and multi-way if statements. More complex features, such as arrays, pointers, records, and recursive function calls, are left to future language extensions. In our experience, translating C, Fortran, and Matlab to FPImp is relatively easy. FPImp has expressions similar to those in FPCore, but without if, while, and let expressions; these are instead replaced with statements. The FPImp syntax is as follows:

figure c

As in FPCore, each FPImp benchmark includes free parameters and properties; as an imperative language, it includes a list of program statements and multiple return values instead of a single body expression. Assignments and while loops behave in the usual way. Expressions in FPImp are evaluated similarly to FPCore expressions. An if statement defines a many-way branch; any else branch must be the last branch in its if. The output statement can return several values and appears at the end of a function.

The FPImp to FPCore compiler translates FPImp functions to FPCore benchmarks while retaining all properties and keeping the same set of free parameters. It inlines assignments, converts the imperative bodies of FPImp loops to the simultaneous-assignment loops of FPCore, replaces many-way if statements with nested if s, and generates multiple FPCore benchmarks for FPImp programs with multiple outputs.

3 Accuracy Measurements

To compare floating-point tools, a common input format is not enough: a common measure of accuracy is also necessary. FPBench thus defines a collection of accuracy measures to allow tools to rigorously describe the accuracy measure they use. Given the diversity of accuracy measures in the literature, standardizing on a single accuracy measure would be difficult, and could harm the development of some classes of tools. Instead, FPBench standardizes several measures of accuracy; tools that measure accuracy should state which of the FPBench accuracy measures they use to compare tools.

3.1 Measurement Axes

Floating-point accuracy is best analyzed along several axes: scaling vs non-scaling error, forward vs. backward error, maximum vs. average error. Tools that measure error may use sound vs. statistical techniques, and tools that transform programs have several options for how to measure accuracy improvement.

Scaling vs non-scaling error ( \(\varepsilon {\textit{).}}\) There are several ways to measure the error of producing the inaccurate value \(\hat{x}\) instead of the true value x. Two common mathematical notions are the absolute and relative error:

$$\begin{aligned} \varepsilon _{\text {abs}}(x, x') = \left| x - \hat{x}\right| \quad \text {and} \quad \varepsilon _{\text {rel}}(x, x') = \left| \frac{x - \hat{x}}{x}\right| \end{aligned}$$

Relative error scales with the quantity being measured, and thus makes sense for measuring both large and small numbers, much like the floating-point format itself. A notion of error more closely tied to the floating-point format is the Units in the Last Place (ULPs)Footnote 2 difference. Some tools instead use the logarithm of the ULPs, which approximately describes the number of incorrect low-order bits in \(\hat{x}\). These two measures are defined as:Footnote 3

$$\begin{aligned} \varepsilon _{\text {ulps}}(x, x') = |\mathbb {F} \cap [\min (x, \hat{x}), \max (x, \hat{x})]| \quad \text {and} \quad \varepsilon _{\text {bits}}(x, x') = \log _2 \varepsilon _{\text {ulps}}(x, x') \end{aligned}$$

The floating-point numbers are distributed roughly exponentially, so this measure of error scales in a similar manner to relative error; however, it is better-behaved in the presence of denormal numbers and infinities.

Forward vs. backward error ( \(\epsilon {{\textit{).}}}\) Forward error and backward error are two common measures for the error of a mathematical computation. For a true function f and its approximation \(\hat{f}\), forward error measures the difference between outputs for a fixed input, while backward error measures the difference between inputs for a fixed output. Formally,Footnote 4

$$\begin{aligned} \epsilon _{fwd}(f, \hat{f}, x) = \varepsilon (f(x), \hat{f}(x)) \end{aligned}$$
$$\begin{aligned} \epsilon _{bwd}(f, \hat{f}, x) = \min \left\{ \varepsilon (x, x') : x' \in \mathbb {F}^n \wedge {\hat{f}}(x') = f(x) \right\} \!. \end{aligned}$$

Backward error is important for evaluating the stability of an algorithm, and in scientific applications where multiple sources of error (algorithmic error vs. sensor error) must be compared. Existing tools typically measure forward error because backward error can be tricky to compute for floating-point computations, where there may not be an input that produces the true output.

Average vs. maximum error (E). Describing the error of a floating-point computation means summarizing its behavior across multiple inputs. Existing tools use either maximum or average error for this task. Formally,Footnote 5

$$\begin{aligned} E_{\text {max}}(f, \hat{f}) = \max \left\{ \epsilon (f, \hat{f}, x) : x \in \mathbb {F}^n\right\} \quad \text {and} \quad E_{\text {avg}}(f, \hat{f}) = \frac{1}{N} \sum _{x\in \mathbb {F}^n} \epsilon (f, \hat{f}, x). \end{aligned}$$

Worst-case error tends to be easier to measure soundly, while average error tends to be easier to measure statistically.

Sound vs. statistical techniques. Running a floating-point program on all valid inputs is intractable. Existing tools either soundly overapproximate the error using static analysis or approximate the error using statistical sampling.

Most static techniques are based on interval or affine arithmetic to over-approximate floating-point arithmetic, often using abstract interpretation. Abstract interpretation may use either non-relational [20] or relational abstract domains [2, 7, 14], and may use acceleration techniques (widenings [9]) to over-approximate loops without unrolling them. While such techniques tend to provide loose over-approximations of the floating-point error of programs, they are fast and provide sound error bounds. In some embedded applications, correctness is critical and unsound techniques will not do.

In domains where correctness is not absolutely critical, sampling can provide tight approximations of error. Many sampling techniques are used, including naive random samples [21] and Markov-chain Monte Carlo [23]. These techniques involve running a program multiple times, so tend to be slower than static analysis.

Measuring improvement ( \(\iota {\textit{).}}\) Tools that transform floating-point programs need to compare the accuracy of two floating-point programs: the original and the transformed. Several comparison measures are possible. Comparisons can use the improvement in worst-case or average error between the original \(\hat{f}\) and improved \(\hat{f}'\) implementation of the same mathematical function f:

$$\begin{aligned} \iota _{\text {imp}} = E(f, \hat{f}) - E(f, \hat{f}') \end{aligned}$$

However, one cannot usually improve the accuracy of a computation simultaneously on all inputs. It is thus often necessary to make a computation less accurate on some points to make it more accurate overall. In this case, it may be useful to report the largest unimprovement, which measures the cost of improving accuracy:

$$\begin{aligned} \iota _{\text {wrs}}(\hat{f},\hat{f}') = \max \left\{ \epsilon (f, \hat{f}', x) - \epsilon (f, \hat{f}', x) \ :\ x\in \mathbb {F}^n\right\} \end{aligned}$$

Other measures, such as those describing the trade-off between accuracy and speed, are also interesting, but are less commonly used in the literature and thus not standardized in FPBench. Improvement tools could also estimate their effect on numerical stability using automatic differentiation [15] or Lyapunov exponents [22], but we do not know of any such tools.

3.2 Existing Tools

The error measures described can be applied to categorize the error measurements used by existing tools. Table 1 compares Fluctuat [12], FPTaylor [24], Herbie [21], Rosa [11], and Salsa [10].

Table 1. A comparison of how five floating-point measure error across the axes identified in this section.

Fluctuat, FPTaylor, and Rosa all verify error bounds on the accuracy of floating-point computations. Given their need for soundness, it is natural that they use sound error analyses and estimate maximum error. Their use of absolute forward error derives from the difficulty of approximating the other forms of error statically. Herbie and Salsa are tools for improving the accuracy of programs, but differ dramatically in their approach. Salsa uses abstract interpretation to bound maximum absolute error, producing a sound overapproximation of the maximum error. Herbie, on the other hand, uses random sampling to achieve a tight statistical approximation of bits error. The tight estimates enabled by statistical techniques provide additional opportunities for Herbie to improve the accuracy of an expression, but prevent it from providing sound error bounds. Finally, STOKE uses stochastic search to optimize floating-point programs, and must compare the accuracy of floating-point programs to avoid significantly compromising their accuracy. STOKE uses a Markov-chain Monte Carlo sampling to statistically evaluate maximum ULPs error.

By exactly describing the way each tool measures accuracy, FPBench makes it possible to compare and relate tools. Unsound tools such as Herbie or STOKE can be composed with a sound verification tool to produce an accuracy guarantee, and this guarantee can be compared to the approximate error measurements those tools made statistically. Since Fluctuat, FPTaylor, Rosa, and Salsa all soundly measure maximum forward absolute error, they can be compared to determine which technique is best.

4 Tools

FPBench features a collection of compilers and measurement tools that operates in its common format, FPCore. These tools can be a community resource, increasing interoperability as well as code reuse. They also make it easier to write new floating-point analysis and transformation tools by automating what are currently common but tedious tasks.

FPImp to FPCore. The FPCore format faithfully preserves important program constructs, such as variable binding and operation ordering, while abstracting away details not relevant to floating-point semantics. However, it is syntactically very different from some of the languages from which benchmarks might originate. To make translation to FPCore from source languages like C, Fortran, and Matlab easier, FPBench provides the FPImp format and a compiler from FPImp to FPCore. FPImp is syntactically similar to imperative languages to make it easy to translate benchmarks.

FPCore to C. Since C is a common implementation language for mathematical computations, FPBench provides a FPCore to C compiler. The FPCore to C compiler can be used to run FPCore benchmarks through the many available C analysis tools.

Average error estimation. FPBench provides a tool to statistically approximate average error using naive sampling. The statistical approach is necessary to produce accurate estimates of average error given the current state of the art. The tool can use absolute, relative, ULPs, and bits error.

We plan to continue developing community tools around the FPBench formats, especially tools for estimating the other measures of error described in Sect. 3.

5 Benchmark Suite and Examples

The FPBench suite currently includes 44 benchmarks sourced from recent papers on automatic floating-point verification and accuracy improvement. This section first summarizes these benchmarks and then details how representative examples were translated to FPBench from the input formats of various tools in the literature.

The current FPBench suite contains examples from a variety of domains, including 28 from the Herbie test suite [21], 9 from the Salsa test suite [10], 7 from the Rosa test suite [11], and one example from the FPTaylor test suite [24]. These examples range from simple test programs for early tool development up to large examples for evaluating more mature tools. The larger examples are more challenging, including loops that with up to 13 variables mutated in the loop body. As shown in Tables 2 and 3, these programs exercise the full range of functionality available in FPBench, and span a variety application areas, from control software to mathematical libraries. We intend to continue adding benchmarks to the FPBench suite.

Table 2. Functions and language features used in the FPBench benchmarks. Benchmarks contain a variety of features, and many benchmarks incorporate several. Exponential functions include logarithms, the exponential function, and the power function.
Table 3. Domains which the FPBench benchmarks are drawn from. Most are general mathematical operations, useful in a variety of domains. The general expressions are the smallest, and are drawn from Numerical Methods for Scientists and Engineers [16] and Rosa [11].

5.1 FPTaylor

FPTaylor [24] uses series expansions and interval arithmetic to compute sound error bounds. The authors gave the following simple program as an example input for their tool:

figure d

This program is representative of the code necessary to correct sensor data in control software, where the output of the sensor is known to be between 1.001 and 2.0. We manually translated this program to FPCore, yielding:

figure e

The benchmark takes inputs x and y and uses a let statement to represent the intermediate variables from the FPTaylor example. FPCore faithfully preserves important program constructs, such as variable binding and operation ordering, making the translation a simple matter. The benchmark additionally specifies a name and cites its source using a key to a standard file. Since the original program uses 64-bit floating-point numbers, the type binary64 is specified in the benchmark. The constraints on input variables are translated to a single predicate under the :pre property.

5.2 Rosa

Rosa [11] soundly verifies error bounds of floating-point programs, including looping control flow through recursive function calls. Several benchmarks in its repository demonstrate this capability, including one that uses Newton’s method on a series representation of the sine function. In Rosa’s input language the original benchmark is represented as:

figure g

The require clause denotes input preconditions, and the ensures clause provides the error bound to be verified. We manually translated this program to FPCore, yielding:

figure h

Like the FPTaylor benchmark, this benchmark includes a name, citation for its source, and precondition on inputs. For completeness, we’ve also included the ensuring annotation, denoted :rosa-post, demonstrating the ability to add tool specific annotations by prefixing with the tool name. Note how the while construct from FPCore can be used to represent the tail-recursive loop from the Rosa original benchmark.

5.3 Herbie

Herbie [21] heuristically improves the accuracy of straight-line floating-point expressions. The authors demonstrate the improvements Herbie can produce using the quadratic formula for computing the roots of a second degree polynomial. It has uses from calculating trajectories to solving matrix equations. In mathematical notation, the quadratic formula is given by:Footnote 6

$$\begin{aligned} \frac{(- b) - \sqrt{b^2 - 4ac}}{2a} \end{aligned}$$

Herbie produces the following more-accurate variant:

$$\begin{aligned} {\left\{ \begin{array}{ll} \frac{4ac}{-b + \sqrt{b^2 - 4ac}}/2a &{} \mathbf {if}\, b< 0 \\[9pt] \left( -b - \sqrt{b^2 - 4ac}\right) \frac{1}{2a} &{} \mathbf {if}\, 0 \le b \le 10^{127} \\[5pt] -\frac{b}{a} + \frac{c}{b} &{} \mathbf {if}\, 10^{127} < b \end{array}\right. } \end{aligned}$$

In FPCore format, the original formula is represented as:

figure i

and the improved version is represented as:

figure j

Like the Rosa and FPTaylor examples, this benchmark gives a name, citation, and precondition. This benchmark uses FPCore’s ability to write arbitrary boolean expressions as preconditions, restricting the value under the square root to be non-negative and requiring the denominator be non-zero. Unlike the FPTaylor and Rosa examples, this precondition arises from mathematical considerations, not domain knowledge. The FPCore version of Herbie’s output furthermore uses if constructs to evaluate different expressions for different inputs, which improves accuracy. Herbie could add additional metadata to the output, such as its internal estimate of accuracy or the number of expressions considered during search, by using prefixed keys like :herbie-accuracy-estimate.

5.4 Salsa

Salsa [10] is a tool for soundly improving the worst-case accuracy of programs. The authors evaluate Salsa on a suite of control and numerical algorithms, including the widely used PID controller algorithm. This algorithm is used in aeronautic and avionic systems for which correctness is critical. In C, the benchmark is written as:

figure k

To ease the conversion of this code from C to FPCore, this program was first manually translated to the following FPImp program:

figure l

The FPImp program was then automatically compiled, using the compiler tool in FPBench, to the following FPCore benchmark:

figure m

Beyond the metadata used in the FPTaylor, Rosa, and Herbie examples, this benchmark includes a :description tag to describe for readers what the benchmark program computes. These descriptions contain information about the distribution of inputs, the situation in which the benchmark is used, or any other information which might be useful to tool writers. The FPBench suite features descriptions for its most complex benchmarks.

6 Conclusion

The initial work on FPBench provides a foundation for evaluating and comparing floating-point analysis and optimization tools. Already the FPBench format can serve as a common language, allowing these tools to cooperatively analyze and improve floating-point code.

Though FPBench supports the composition of the floating-point tools that exist today, there is still much work to do to support the floating-point research community as it grows. FPBench must be sufficiently expressive for the broad range of applications that represent the future of the community. In the near term, we will add additional metrics for accuracy and performance to the set of evaluators provided by the FPBench tooling and begin developing a standard set of benchmarks around the various measures. We will also expand the set of languages with direct support for compilation to and from the FPBench format. As more tools grow support for FPBench, we will provide automated comparisons of different floating-point tools. Longer term, we intend to support for mixed-precision benchmarks, fixed-point computations, and additional data structures such as vectors, records, and matrices.

We hope that FPBench encourages the already strong sense of community and collaboration around floating-point research. Toward that end, we encourage any interested readers (and tool writers) to get involved with development of FPBench by signing up for the mailing list and checking out the project website:

http://fpbench.org/.