1 Introduction

Usually floating-point arithmetic over a finite set \(\mathbb {F}\subseteq \mathbb {R}\) is introduced by defining the result of a floating-point operation \(\text{ op }\in \{+,-,*,/\}\) to be \( fl (a\,\text{ op }\, b)\) for \(a,b\in \mathbb {F}\) and a rounding to nearest \( fl : \mathbb {R}\rightarrow \mathbb {F}\). The rounding is unique up to a tie-breaking rule.

More generally, rounding error analysis is often based on the first or the second standard model [4, Sect. 2]. The first standard model of arithmetic assumes

$$\begin{aligned} a,b\in \mathbb {F}: fl (a \,\text{ op }\, b) = (a \,\text{ op }\, b)(1+\delta ),\quad |\delta |\le u,\quad \text{ op }= +,-,*,/ , \end{aligned}$$
(1.1)

while the second standard model assumes

$$\begin{aligned} fl (a \,\text{ op }\, b) = \frac{a \,\text{ op }\, b}{1+\delta },\quad |\delta |\le u . \end{aligned}$$
(1.2)

This is true for all \(a,b\in \mathbb {F}\) as long as no overflow or underflow occurs. The latter is the typical assumption for the analysis of numerical algorithms and also the general assumption for this note. For \(\mathbb {F}\) denoting the set of IEEE 754 binary64 floating-point numbers [5] the relative rounding error unit \(u:=2^{-53}\) is often used for both models.

In the following we assume an arbitrary finite set \(\mathfrak {F}\subseteq \mathbb {R}\) to be given. Without further assumptions we analyze best possible approximation properties of real numbers by elements in \(\mathfrak {F}\). For minimizing the relative error with respect to the true result, i.e. the first standard model, this is clearly rounding to nearest. Minimization with respect to the computed result, i.e., the second standard model, requires a different rounding to be discussed. Finally, we identify the rounding to minimize the error with respect to both standard models simultaneously.

The finiteness of \(\mathfrak {F}\) implies that a small relative error cannot be achieved for real numbers which are very large or very small in absolute value. To state that more precisely we define the normal range Footnote 1 of \(\mathfrak {F}\) by

$$\begin{aligned} \varrho (\mathfrak {F}) := [\min \mathfrak {F}_{<0},\max \mathfrak {F}_{<0}] \cup [\min \mathfrak {F}_{>0},\max \mathfrak {F}_{>0}] \subset \mathbb {R}\end{aligned}$$
(1.3)

and assume \(\varrho (\mathfrak {F})\ne \emptyset \). Zero may be in \(\mathfrak {F}\) or not, but \(0\notin {\varrho }(\mathfrak {F})\) by (1.3). Here \(x>\max \mathfrak {F}_{>0}\) or \(x<\min \mathfrak {F}_{<0}\) corresponds to overflow, and \(\max \mathfrak {F}_{<0} <x< \min \mathfrak {F}_{>0}\) to underflow. Note that \(\varrho (\mathfrak {F})\) consists of all real numbers in one of the two intervals in (1.3).

2 Main results

The smallest possible constant \(u\) for the first standard model (1.1) is characterized by

$$\begin{aligned} \varvec{\alpha }:= \varvec{\alpha }(\mathfrak {F}) = \sup \limits _{x\in {\varrho }(\mathfrak {F})} \min \limits _{f\in \mathfrak {F}} \left| \frac{f-x}{x}\right| . \end{aligned}$$
(2.1)

Note that the minimum is taken over all \(f\in \mathfrak {F}\), whereas the maximum is taken over all x in \(\varrho (\mathfrak {F})\). The constant \(\varvec{\alpha }\) minimizes the relative error with respect to \(x\), i.e., the maximum error of the first standard model (1.1). It is a characteristic constant of a given set \(\mathfrak {F}\).

For a given \(x\in {\varrho }(\mathfrak {F})\), denote a minimizing \(f\in \mathfrak {F}\) in (2.1) by \( fl (x)\). Definition (2.1), the continuity of \(|(f-x)/x|\) and the finiteness of \(\mathfrak {F}\) imply that \( fl (x)\) is uniquely defined for all except finitely many \(x\) in \(\varrho (\mathfrak {F})\). The latter are called the switching points of the otherwise uniquely defined rounding \( fl : {\varrho }(\mathfrak {F})\rightarrow \mathfrak {F}\). Any “tie-breaking” method, that chooses a particular \(f\) value at each switching point, defines a particular round-to-nearest function \( fl \). Note that for rounding to nearest the switching points are the arithmetic means of adjacent elements in \(\mathfrak {F}\).

The rounding function can be extended to \( fl : \mathbb {R}\rightarrow \mathfrak {F}\), so that any rounding to nearest \( fl : \mathbb {R}\rightarrow \mathfrak {F}\) with \(| fl (x)-x|=\min \{|f'-x|:f'\in \mathfrak {F}\}\) for all \(x\in \mathbb {R}\) satisfies

$$\begin{aligned} \varvec{\alpha }= \sup \limits _{x\in {\varrho }(\mathfrak {F})} \left| \frac{ fl (x)-x}{x}\right| , \end{aligned}$$
(2.2)

no matter how ties are rounded.

For rounding to nearest \( fl : \mathbb {R}\rightarrow \mathfrak {F}\) the maximum error of the first standard model (1.1) is minimized, whereas the maximum error for the second standard model (1.2) is given by

$$\begin{aligned} \varvec{\beta }:= \varvec{\beta }(\mathfrak {F}) = \sup \limits _{x\in {\varrho }(\mathfrak {F})} \left| \frac{ fl (x)-x}{ fl (x)}\right| . \end{aligned}$$
(2.3)

Since the supremum is taken, the definition of \(\varvec{\beta }\) is independent of a tie-breaking rule. Thus, \(\varvec{\alpha }\) and \(\varvec{\beta }\) are well defined and are characteristic invariants of the set \(\mathfrak {F}\), and they are related as follows.

Lemma 2.1

Let finite \(\mathfrak {F}\subseteq \mathbb {R}\) with \(\varrho (\mathfrak {F})\ne \emptyset \) be given. If the positive and the negative parts of \(\varrho (\mathfrak {F})\) consist of at most one element, then \(\varvec{\alpha }=\varvec{\beta }=0\). Otherwise,

$$\begin{aligned} \varvec{\alpha }= \frac{\varvec{\beta }}{1+\varvec{\beta }} = \max \left\{ \left| \frac{g-f}{g+f}\right| : f,g \,\mathrm{adjacent\, in }\,\varrho (\mathfrak {F}) \,\mathrm{with }\, fg>0 \right\} < 1 \end{aligned}$$
(2.4)

for \(\varvec{\alpha }\) and \(\varvec{\beta }\) defined by (2.1) and (2.3), respectively.

Proof

If the positive and/or negative part of \(\varrho (\mathfrak {F})\) is empty or consists of only one element, there is nothing to prove for that part. Let fixed but arbitrary positive adjacent elements \(f,g\in \mathfrak {F}\) be given with \(0<f<g\). For \(x\in [f,g]\) we have \( fl (x)\in \{f,g\}\), and it is not hard to see that \(| fl (x)-x|/|x|\) is maximized for \(x=m:=(f+g)/2\). This maximum is

$$\begin{aligned} \alpha :=\frac{m-f}{m} = \frac{g-m}{m} = \frac{(g-f)/2}{(g+f)/2} = \frac{g-f}{g+f} . \end{aligned}$$

This implies in particular \(\alpha <1\). For increasing \(x\in [f,m)\) the function \(|f-x|/|f|\) increases, and for increasing \(x\in (m,g]\) the function \(|g-x|/|g|\) decreases. For \(x:=m-\varepsilon \) and sufficiently small \(\varepsilon >0\) we have \( fl (x)=f\) and

$$\begin{aligned} \frac{|f-x|}{|f|} = \frac{(g+f)/2-\varepsilon -f}{f} = \frac{g-f}{2f} - \mathcal {O}(\varepsilon ) . \end{aligned}$$

On the other hand, for \(x:=m+\varepsilon \) and sufficiently small \(\varepsilon >0\) we have \( fl (x)=g\), and therefore

$$\begin{aligned} \frac{|g-x|}{|g|} = \frac{g-(g+f)/2-\varepsilon }{g} < \frac{g-f}{2g} < \frac{g-f}{2f} . \end{aligned}$$

Thus

$$\begin{aligned} \beta := \sup \left\{ \left| \frac{ fl (x)-x}{ fl (x)}\right| : f\le x\le g\right\} = \frac{g-f}{2f} . \end{aligned}$$

Moreover,

$$\begin{aligned} \frac{\alpha }{1-\alpha } = \frac{(g-f)/(g+f)}{(g+f-g+f)/(g+f)} = \frac{g-f}{2f} = \beta \quad \text{ and } \text{ therefore }\quad \alpha =\frac{\beta }{1+\beta }. \end{aligned}$$

A similar argument applies to negative adjacent elements \(f,g\in {\varrho }(\mathfrak {F})\), so that on every such interval \([f,g]\) the maximal \(\alpha \) and \(\beta \) are related by \(\beta =\alpha /(1-\alpha )\). Since \(f,g\) are chosen arbitrarily and since the mapping \(\alpha \rightarrow \alpha /(1-\alpha )\) is increasing for \(0\le \alpha <1\), the proof is finished. \(\square \)

For \(\mathfrak {F}\) denoting the set \(\mathbb {F}\) of IEEE 754 binary64 floating-point numbers barring underflow we may restrict \(x\) in Definition (2.1) to the interval \([1,2)\). Abbreviating \(u:=2^{-53}\), the maximum is achieved for \(x=1+u\), the midpoint of \(1\) and its successor in \(\mathbb {F}\). It follows \(\varvec{\alpha }=u/(1+u)\) for the first standard model, and \(\varvec{\beta }=u\) for the second one, the characteristic constants of \(\mathbb {F}\). In the literature (e.g., [4]) the same constant \(u\) is used for both standard models. This improvement for the first model was noted in [8], see also [6].

Next we ask for a rounding realizing the smallest possible constant \(u\) for the second standard model (1.2). The corresponding characteristic constant is defined by

$$\begin{aligned} \mathbf{w}:= \mathbf{w}(\mathfrak {F}) = \sup \limits _{x\in {\varrho }(\mathfrak {F})} \min \limits _{f\in \mathfrak {F}\backslash \{0\}} \left| \frac{f-x}{f}\right| . \end{aligned}$$
(2.5)

For given \(x\in {\varrho }(\mathfrak {F})\) we first identify \(\varphi \in \mathfrak {F}\) minimizing \(|\varphi -x|/|\varphi |\). If \(x\in \mathfrak {F}\), then clearly it suffices to take \(\varphi :=x\). Otherwise the definition (1.3) of the range implies the existence of adjacent \(f,g\in \mathfrak {F}\cap {\varrho }(\mathfrak {F})\) with \(f<x<g\). The function \(|\varphi -x|/|\varphi |\) decreases on \(\varphi \in (0,x]\) and increases on \(\varphi \in [x,\infty )\), so that \(\varphi \in \{f,g\}\). For increasing \(y\in [f,g]\) the ratio \(|f-y|/|f|\) increases and \(|g-y|/|g|\) decreases. Equality is achieved for \(y:=h(f,g):=2(\frac{1}{f}+\frac{1}{g})^{-1}\), the harmonic mean of \(f\) and \(g\), because

$$\begin{aligned} \left| \frac{f-y}{f}\right| = \frac{2fg/(f+g)-f}{f} = \frac{g-f}{g+f} = \frac{g-2fg/(f+g)}{g} = \left| \frac{g-y}{g}\right| .\qquad \end{aligned}$$
(2.6)

Thus \(\varphi \) minimizing \(|\varphi -x|/|\varphi |\) is \(f\) for \(x\in [f,h(f,g))\), it is \(g\) for \(x\in (h(f,g),g]\), and is either \(f\) or \(g\) if \(x=h(f,g)\). Hence a rounding \(\text{ gl }: {\varrho }(\mathfrak {F})\rightarrow \mathfrak {F}\) with the property

$$\begin{aligned} \left| \frac{\text{ gl }(x)-x}{\text{ gl }(x)}\right| = \min \limits _{f\in \mathfrak {F}} \left| \frac{f-x}{f}\right| \end{aligned}$$

is characterized by being a projection and satisfying

$$\begin{aligned} \text{ gl }(x) \; \left\{ \begin{array}{ll} = f &{} \quad \text{ for } x\in [f,h) \\ \in \{f,g\} &{} \quad \text{ for } x=h \\ = g &{} \quad \text{ for } x\in (h,g] \end{array}\right. \end{aligned}$$
(2.7)

for \(x\in [f,g]\) with adjacent \(f,g\in {\varrho }(\mathfrak {F})\) and \(h\) denoting the harmonic mean of \(f\) and \(g\).

For rounding to nearest the switching point is the arithmetic mean of adjacent elements \(f,g\in \mathfrak {F}\), now it is the harmonic mean. In both cases there is freedom about the tie without jeopardizing the minimization properties.

For a rounding \(\text{ gl }: {\varrho }(\mathfrak {F})\rightarrow \mathfrak {F}\) with (2.7) we showed

$$\begin{aligned} \mathbf{w}= \mathbf{w}(\mathfrak {F}) = \sup \limits _{x\in \varrho (\mathfrak {F})} \left| \frac{\text{ gl }(x)-x}{\text{ gl }(x)}\right| , \end{aligned}$$
(2.8)

minimizing the error of the second standard model (1.2). Note that \(\mathbf{w}\), just like \(\varvec{\alpha }\), solely depends on \(\mathfrak {F}\). Using the rounding \(\text{ gl }: \varrho (\mathfrak {F})\rightarrow \mathfrak {F}\) with (2.7) to minimize the error in the second standard model (1.2), the maximum error for the first standard model (1.1) is given by

$$\begin{aligned} \mathbf{v}:= \mathbf{v}(\mathfrak {F}) = \sup \limits _{x\in \varrho (\mathfrak {F})} \left| \frac{\text{ gl }(x)-x}{x}\right| . \end{aligned}$$
(2.9)

As before we will show that \(\mathbf{v}\) is invariant for any rounding \(\text{ gl }: \mathbb {R}\rightarrow \mathfrak {F}\) with (2.7). Thus \(\mathbf{v}\) and \(\mathbf{w}\) are well defined and are again characteristic invariants of the set \(\mathfrak {F}\). For any \(\mathfrak {F}\), the invariants \(\varvec{\alpha }, \varvec{\beta }, \mathbf{v}, \mathbf{w}\) are related as follows.

Lemma 2.2

Let finite \(\mathfrak {F}\subseteq \mathbb {R}\) with \(\varrho (\mathfrak {F})\ne \emptyset \) be given. Let \(\mathbf{v}\), \(\mathbf{w}\), \(\varvec{\alpha }\) and \(\varvec{\beta }\) as defined in (2.9), (2.5), (2.1) and (2.3), respectively. If the positive and the negative parts of \(\varrho (\mathfrak {F})\) consist of at most one element, then \(\mathbf{v}=\mathbf{w}=0\). Otherwise,

$$\begin{aligned} \mathbf{v}= \frac{\mathbf{w}}{1-\mathbf{w}} = \max \left\{ \left| \frac{g-f}{2\min \{|f|,|g|\}}\right| : f,g \,\mathrm{adjacent\, in }\,\varrho (\mathfrak {F}) \,\mathrm{with }\, fg>0 \right\} \qquad \quad \end{aligned}$$
(2.10)

and therefore

$$\begin{aligned} \mathbf{v}= \varvec{\beta }\quad \mathrm{and}\quad \mathbf{w}=\varvec{\alpha }<1 . \end{aligned}$$
(2.11)

Proof

If the positive and/or negative part of \(\varrho (\mathfrak {F})\) is empty or consists of only one element, there is nothing to prove for that part. Let fixed but arbitrary positive adjacent elements \(f,g\in \mathfrak {F}\) be given with \(0<f<g\). By (2.6) the ratio \(|\text{ gl }(x)-x|/|\text{ gl }(x)|\) is maximal for \(x\) being the harmonic mean of \(f\) and \(g\). That maximum value is \((g-f)/(g+f)\), so that applying this argument to the negative part of \(\varrho (\mathfrak {F})\) and using (2.4) shows \(\mathbf{w}=\varvec{\alpha }<1\).

The supremum of the ratio \(|\text{ gl }(x)-x|/|x|\) is also achieved for the harmonic mean \(x:=2fg/(f+g)\), and using \(x-f<g-x\) it is equal to

$$\begin{aligned} \frac{g-x}{x} = \frac{g}{2fg/(f+g)} - 1 = \frac{g-f}{2f} = \frac{\alpha }{1-\alpha } \end{aligned}$$

for \(\alpha :=(g-f)/(g+f)\). Applying the same argument to the negative part of \(\varrho (\mathfrak {F})\) and using (2.9) and (2.4) implies

$$\begin{aligned} \mathbf{v}= \sup \left\{ \left| \frac{g-f}{2\min \{|f|,|g|\}}\right| : f,g \,\text{ adjacent } \text{ in }\,\varrho (\mathfrak {F}) \,\text{ with }\, fg>0 \right\} = \frac{\varvec{\alpha }}{1-\varvec{\alpha }} . \end{aligned}$$

A computation using (2.4) yields \(\varvec{\beta }= \varvec{\alpha }/(1-\varvec{\alpha })\), and the results follow. \(\square \)

3 Examples

For a given set \(\mathfrak {F}\subseteq \mathbb {R}\), knowing one of the invariants \(\mathbf{v}\), \(\mathbf{w}\), \(\varvec{\alpha }\) or \(\varvec{\beta }\) means knowing the others. Choosing the appropriate rounding \( fl \) or \(\text{ gl }\), i.e., the switching points being the arithmetic or the harmonic mean, the maximum error for both models is the same.

For any finite set \(\mathfrak {F}\) one invariant can be computed using (2.4) or (2.10), then the other invariants follow. As an example consider a logarithmic number system (LNS)

$$\begin{aligned} \mathfrak {F}:= \{ c^k : k\in \mathbb {Z}, k_1\le k\le k_2\} \quad \text{ for } \quad 1<c\in \mathbb {R}. \end{aligned}$$
(3.1)

Earliest references of this well studied concept include [7, 9, 10]; see [1] and the literature cited over there for recent publications. Assume the integers \(k_1,k_2\) satisfy \(k_1<k_2\) to ensure that \(\varrho (\mathfrak {F})\) has at least two elements. Then \(\varrho (\mathfrak {F})=[c^{k_1},c^{k_2}]\), and (2.10) yields

$$\begin{aligned} \mathbf{v}= \frac{c^{k+1}-c^k}{2c^k} = \frac{c-1}{2}\quad \text{ and }\quad \mathbf{w}= \frac{\mathbf{v}}{1+\mathbf{v}} = \frac{c-1}{c+1} . \end{aligned}$$

These characteristic constants are the same for \(\mathfrak {F}\cup (-\mathfrak {F})\) or for \(\mathfrak {F}\cup \{0\}\cup (-\mathfrak {F})\). The elements could be stored by the integer exponent. If the product or quotient of elements in \(\mathfrak {F}\) is in \(\varrho (\mathfrak {F})\), then they are in \(\mathfrak {F}\). In terms of floating-point operations this means that multiplication and division are error-free if no over—or underflow occurs.

Moreover, those sets bear the advantage that the maximum relative rounding error for the first and for the second standard model is attained on each interval \([f,g]\) of adjacent elements in \(\varrho (\mathfrak {F})\). In contrast, there is a factor of up to \(\beta \) for floating-point systems to base \(\beta \), depending on whether the interval \([f,g]\) is slightly larger or smaller than a power of \(\beta \).

A possible choice is \(c:=2^\varepsilon \) with \(\varepsilon :=2^{-52}\). Spending one bit for the sign, integers up to about \({\pm }2^{62}\) can be stored in \(64\)-bit, so that the positive range is about \([2^{-1024}, 2^{1024}]\), similar to IEEE 754 double precision. The relative rounding error unit is then about \(0.77\times 10^{-16}\) compared to about \(1.11\times 10^{-16}\) in double precision. With this choice of \(c\), powers of \(2\) are in \(\mathfrak {F}\), however, as a disadvantage, \(\mathfrak {F}\) contains no other integers.

Figure 1 shows a typical graph of rounding errors for binary floating-point and for the logarithmic number system (3.1). The left two pictures show the relative rounding error with respect to the first and second standard model for \(p=4\) bits binary precision according to IEEE 754. Typically, the error decreases in steps towards the next power \(2\). The third picture is for a logarithmic number system as in (3.1), where the maximal error at the switching points is constant.

Fig. 1
figure 1

Relative rounding errors for IEEE 754 arithmetic and logarithmic number system

Another example is level-index arithmetic [2, 3]. This interesting approach was introduced to overcome problems with over—or underflow. Basically, \(\mathfrak {F}\) is defined to consist of \(s\cdot \Phi (\ell +x)^{t}\) for \(\ell \in \mathbb {N}_0\), \(s,t\in \{-1,1\}\) and \(x\in \mathcal {M}\) for some finite set \(\mathcal {M}\subseteq [0,1)\), where

$$\begin{aligned} \Phi (x) = \left\{ \begin{array}{ll} x &{} \quad \text{ for } 0 \le x < 1 \\ e^{\Phi (x-1)} &{} \quad \text{ for } x \ge 1 .\end{array}\right. \end{aligned}$$
(3.2)

The integer part \(\ell \) of \(y:=\ell +x\) is called the level of \(y\). Even if the level is limited to small integers, the range of \(\mathfrak {F}\) is huge comprising of numbers very small and very large in magnitude.

This basically solves the problem of over—and underflow, however, it comes with a price concerning the maximum relative error of floating-point operations. For \(\ell \in \mathbb {N}_0\), \(0<x\in \mathbb {R}\) and small \(\varepsilon \) we have

$$\begin{aligned} \Phi (\ell +x+\varepsilon ) = \Phi (\ell +x) \cdot \left( 1+\varepsilon \prod _{\nu =1}^{\ell -1}{\Phi (\nu +x)} \right) + \mathcal {O}(\varepsilon ^2). \end{aligned}$$
(3.3)

That means, the distance between adjacent elements in \(\mathfrak {F}\) increases exponentially. Thus by (3.3), the maximum relative errors \(\varvec{\alpha }\) and \(\varvec{\beta }\), which are characterized by Lemma 2.1, increase exponentially with the level as well. Hence avoiding over—and underflow problems results in a maximal relative error of floating-point operations close to 1. As a consequence, at least the usual way to analyze floating-point algorithms seems hardly possible.

4 Conclusion

From the perspective of a rounding function the situation for binary64 can be summarized as follows. If the switching point for the rounding is set to the arithmetic mean of adjacent elements, that is the usual rounding to nearest, then we obtain the bound \(\varvec{\beta }=2^{-53}\) for the second standard model, and \(\varvec{\alpha }=2^{-53}/(1+2^{-53})\) is the optimal error bound for the first standard model.

Defining the switching points to be the harmonic mean of adjacent elements, the invariants change places: The maximum relative error with respect to \(x\) (the first standard model) is \(\mathbf{v}=2^{-53}\), whereas the maximum error relative to its rounded value \(f\) (the second standard model) becomes \(\mathbf{w}=2^{-53}\!{/}(1+2^{-53})\). Now \(\mathbf{w}=\varvec{\alpha }\) becomes the optimal error bound for the second standard model.

By continuity there must be a switching point between the harmonic and the arithmetic mean for which both the maximum relative error of all \(x\) with respect to both standard models is bounded by the same constant. Using the arithmetic-geometric mean inequality one verifies that for adjacent floating-point numbers \(0<f<g\) the maximum relative error for both models is equal to \(\sqrt{g/f}-1\) for \(x\) equal to the geometric mean \(\sqrt{fg}\), and that this relative error is maximized for \(f=2^e\) and its successor \(g=(1+2\mathbf{v})2^e\).

It follows that for the switching points \(x:=\text{ sign }(f)\sqrt{|fg|}\) the relative errors of both the first and the second standard models are bounded by the same characteristic constant \(\sqrt{1+2\mathbf{v}}-1\), slightly less than \(\mathbf{v}\) and slightly larger than \(\mathbf{v}/(1+\mathbf{v})\). The relation harmonic \(\le \) geometric \(\le \) arithmetic mean finds its counterpart in the corresponding maximal relative errors \(\mathbf{v}/(1+\mathbf{v})\le \sqrt{1+2\mathbf{v}}-1 \le \mathbf{v}\).