Abstract
The result of a floating-point operation is usually defined to be the floating-point number nearest to the exact real result together with a tie-breaking rule. This is called the first standard model of floating-point arithmetic, and the analysis of numerical algorithms is often solely based on that. In addition, a second standard model is used specifying the maximum relative error with respect to the computed result. In this note we take a more general perspective. For an arbitrary finite set of real numbers we identify the rounding to minimize the relative error in the first or the second standard model. The optimal “switching points” are the arithmetic or the harmonic means of adjacent floating-point numbers. Moreover, the maximum relative error of both models is minimized by taking the geometric mean. If the maximum relative error in one model is \(\alpha \), then \(\alpha /(1-\alpha )\) is the maximum relative error in the other model. Those maximal errors, that is the unit roundoff, are characteristic constants of a given finite set of reals: The floating-point model to be optimized identifies the rounding and the unit roundoff.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Usually floating-point arithmetic over a finite set \(\mathbb {F}\subseteq \mathbb {R}\) is introduced by defining the result of a floating-point operation \(\text{ op }\in \{+,-,*,/\}\) to be \( fl (a\,\text{ op }\, b)\) for \(a,b\in \mathbb {F}\) and a rounding to nearest \( fl : \mathbb {R}\rightarrow \mathbb {F}\). The rounding is unique up to a tie-breaking rule.
More generally, rounding error analysis is often based on the first or the second standard model [4, Sect. 2]. The first standard model of arithmetic assumes
while the second standard model assumes
This is true for all \(a,b\in \mathbb {F}\) as long as no overflow or underflow occurs. The latter is the typical assumption for the analysis of numerical algorithms and also the general assumption for this note. For \(\mathbb {F}\) denoting the set of IEEE 754 binary64 floating-point numbers [5] the relative rounding error unit \(u:=2^{-53}\) is often used for both models.
In the following we assume an arbitrary finite set \(\mathfrak {F}\subseteq \mathbb {R}\) to be given. Without further assumptions we analyze best possible approximation properties of real numbers by elements in \(\mathfrak {F}\). For minimizing the relative error with respect to the true result, i.e. the first standard model, this is clearly rounding to nearest. Minimization with respect to the computed result, i.e., the second standard model, requires a different rounding to be discussed. Finally, we identify the rounding to minimize the error with respect to both standard models simultaneously.
The finiteness of \(\mathfrak {F}\) implies that a small relative error cannot be achieved for real numbers which are very large or very small in absolute value. To state that more precisely we define the normal range Footnote 1 of \(\mathfrak {F}\) by
and assume \(\varrho (\mathfrak {F})\ne \emptyset \). Zero may be in \(\mathfrak {F}\) or not, but \(0\notin {\varrho }(\mathfrak {F})\) by (1.3). Here \(x>\max \mathfrak {F}_{>0}\) or \(x<\min \mathfrak {F}_{<0}\) corresponds to overflow, and \(\max \mathfrak {F}_{<0} <x< \min \mathfrak {F}_{>0}\) to underflow. Note that \(\varrho (\mathfrak {F})\) consists of all real numbers in one of the two intervals in (1.3).
2 Main results
The smallest possible constant \(u\) for the first standard model (1.1) is characterized by
Note that the minimum is taken over all \(f\in \mathfrak {F}\), whereas the maximum is taken over all x in \(\varrho (\mathfrak {F})\). The constant \(\varvec{\alpha }\) minimizes the relative error with respect to \(x\), i.e., the maximum error of the first standard model (1.1). It is a characteristic constant of a given set \(\mathfrak {F}\).
For a given \(x\in {\varrho }(\mathfrak {F})\), denote a minimizing \(f\in \mathfrak {F}\) in (2.1) by \( fl (x)\). Definition (2.1), the continuity of \(|(f-x)/x|\) and the finiteness of \(\mathfrak {F}\) imply that \( fl (x)\) is uniquely defined for all except finitely many \(x\) in \(\varrho (\mathfrak {F})\). The latter are called the switching points of the otherwise uniquely defined rounding \( fl : {\varrho }(\mathfrak {F})\rightarrow \mathfrak {F}\). Any “tie-breaking” method, that chooses a particular \(f\) value at each switching point, defines a particular round-to-nearest function \( fl \). Note that for rounding to nearest the switching points are the arithmetic means of adjacent elements in \(\mathfrak {F}\).
The rounding function can be extended to \( fl : \mathbb {R}\rightarrow \mathfrak {F}\), so that any rounding to nearest \( fl : \mathbb {R}\rightarrow \mathfrak {F}\) with \(| fl (x)-x|=\min \{|f'-x|:f'\in \mathfrak {F}\}\) for all \(x\in \mathbb {R}\) satisfies
no matter how ties are rounded.
For rounding to nearest \( fl : \mathbb {R}\rightarrow \mathfrak {F}\) the maximum error of the first standard model (1.1) is minimized, whereas the maximum error for the second standard model (1.2) is given by
Since the supremum is taken, the definition of \(\varvec{\beta }\) is independent of a tie-breaking rule. Thus, \(\varvec{\alpha }\) and \(\varvec{\beta }\) are well defined and are characteristic invariants of the set \(\mathfrak {F}\), and they are related as follows.
Lemma 2.1
Let finite \(\mathfrak {F}\subseteq \mathbb {R}\) with \(\varrho (\mathfrak {F})\ne \emptyset \) be given. If the positive and the negative parts of \(\varrho (\mathfrak {F})\) consist of at most one element, then \(\varvec{\alpha }=\varvec{\beta }=0\). Otherwise,
for \(\varvec{\alpha }\) and \(\varvec{\beta }\) defined by (2.1) and (2.3), respectively.
Proof
If the positive and/or negative part of \(\varrho (\mathfrak {F})\) is empty or consists of only one element, there is nothing to prove for that part. Let fixed but arbitrary positive adjacent elements \(f,g\in \mathfrak {F}\) be given with \(0<f<g\). For \(x\in [f,g]\) we have \( fl (x)\in \{f,g\}\), and it is not hard to see that \(| fl (x)-x|/|x|\) is maximized for \(x=m:=(f+g)/2\). This maximum is
This implies in particular \(\alpha <1\). For increasing \(x\in [f,m)\) the function \(|f-x|/|f|\) increases, and for increasing \(x\in (m,g]\) the function \(|g-x|/|g|\) decreases. For \(x:=m-\varepsilon \) and sufficiently small \(\varepsilon >0\) we have \( fl (x)=f\) and
On the other hand, for \(x:=m+\varepsilon \) and sufficiently small \(\varepsilon >0\) we have \( fl (x)=g\), and therefore
Thus
Moreover,
A similar argument applies to negative adjacent elements \(f,g\in {\varrho }(\mathfrak {F})\), so that on every such interval \([f,g]\) the maximal \(\alpha \) and \(\beta \) are related by \(\beta =\alpha /(1-\alpha )\). Since \(f,g\) are chosen arbitrarily and since the mapping \(\alpha \rightarrow \alpha /(1-\alpha )\) is increasing for \(0\le \alpha <1\), the proof is finished. \(\square \)
For \(\mathfrak {F}\) denoting the set \(\mathbb {F}\) of IEEE 754 binary64 floating-point numbers barring underflow we may restrict \(x\) in Definition (2.1) to the interval \([1,2)\). Abbreviating \(u:=2^{-53}\), the maximum is achieved for \(x=1+u\), the midpoint of \(1\) and its successor in \(\mathbb {F}\). It follows \(\varvec{\alpha }=u/(1+u)\) for the first standard model, and \(\varvec{\beta }=u\) for the second one, the characteristic constants of \(\mathbb {F}\). In the literature (e.g., [4]) the same constant \(u\) is used for both standard models. This improvement for the first model was noted in [8], see also [6].
Next we ask for a rounding realizing the smallest possible constant \(u\) for the second standard model (1.2). The corresponding characteristic constant is defined by
For given \(x\in {\varrho }(\mathfrak {F})\) we first identify \(\varphi \in \mathfrak {F}\) minimizing \(|\varphi -x|/|\varphi |\). If \(x\in \mathfrak {F}\), then clearly it suffices to take \(\varphi :=x\). Otherwise the definition (1.3) of the range implies the existence of adjacent \(f,g\in \mathfrak {F}\cap {\varrho }(\mathfrak {F})\) with \(f<x<g\). The function \(|\varphi -x|/|\varphi |\) decreases on \(\varphi \in (0,x]\) and increases on \(\varphi \in [x,\infty )\), so that \(\varphi \in \{f,g\}\). For increasing \(y\in [f,g]\) the ratio \(|f-y|/|f|\) increases and \(|g-y|/|g|\) decreases. Equality is achieved for \(y:=h(f,g):=2(\frac{1}{f}+\frac{1}{g})^{-1}\), the harmonic mean of \(f\) and \(g\), because
Thus \(\varphi \) minimizing \(|\varphi -x|/|\varphi |\) is \(f\) for \(x\in [f,h(f,g))\), it is \(g\) for \(x\in (h(f,g),g]\), and is either \(f\) or \(g\) if \(x=h(f,g)\). Hence a rounding \(\text{ gl }: {\varrho }(\mathfrak {F})\rightarrow \mathfrak {F}\) with the property
is characterized by being a projection and satisfying
for \(x\in [f,g]\) with adjacent \(f,g\in {\varrho }(\mathfrak {F})\) and \(h\) denoting the harmonic mean of \(f\) and \(g\).
For rounding to nearest the switching point is the arithmetic mean of adjacent elements \(f,g\in \mathfrak {F}\), now it is the harmonic mean. In both cases there is freedom about the tie without jeopardizing the minimization properties.
For a rounding \(\text{ gl }: {\varrho }(\mathfrak {F})\rightarrow \mathfrak {F}\) with (2.7) we showed
minimizing the error of the second standard model (1.2). Note that \(\mathbf{w}\), just like \(\varvec{\alpha }\), solely depends on \(\mathfrak {F}\). Using the rounding \(\text{ gl }: \varrho (\mathfrak {F})\rightarrow \mathfrak {F}\) with (2.7) to minimize the error in the second standard model (1.2), the maximum error for the first standard model (1.1) is given by
As before we will show that \(\mathbf{v}\) is invariant for any rounding \(\text{ gl }: \mathbb {R}\rightarrow \mathfrak {F}\) with (2.7). Thus \(\mathbf{v}\) and \(\mathbf{w}\) are well defined and are again characteristic invariants of the set \(\mathfrak {F}\). For any \(\mathfrak {F}\), the invariants \(\varvec{\alpha }, \varvec{\beta }, \mathbf{v}, \mathbf{w}\) are related as follows.
Lemma 2.2
Let finite \(\mathfrak {F}\subseteq \mathbb {R}\) with \(\varrho (\mathfrak {F})\ne \emptyset \) be given. Let \(\mathbf{v}\), \(\mathbf{w}\), \(\varvec{\alpha }\) and \(\varvec{\beta }\) as defined in (2.9), (2.5), (2.1) and (2.3), respectively. If the positive and the negative parts of \(\varrho (\mathfrak {F})\) consist of at most one element, then \(\mathbf{v}=\mathbf{w}=0\). Otherwise,
and therefore
Proof
If the positive and/or negative part of \(\varrho (\mathfrak {F})\) is empty or consists of only one element, there is nothing to prove for that part. Let fixed but arbitrary positive adjacent elements \(f,g\in \mathfrak {F}\) be given with \(0<f<g\). By (2.6) the ratio \(|\text{ gl }(x)-x|/|\text{ gl }(x)|\) is maximal for \(x\) being the harmonic mean of \(f\) and \(g\). That maximum value is \((g-f)/(g+f)\), so that applying this argument to the negative part of \(\varrho (\mathfrak {F})\) and using (2.4) shows \(\mathbf{w}=\varvec{\alpha }<1\).
The supremum of the ratio \(|\text{ gl }(x)-x|/|x|\) is also achieved for the harmonic mean \(x:=2fg/(f+g)\), and using \(x-f<g-x\) it is equal to
for \(\alpha :=(g-f)/(g+f)\). Applying the same argument to the negative part of \(\varrho (\mathfrak {F})\) and using (2.9) and (2.4) implies
A computation using (2.4) yields \(\varvec{\beta }= \varvec{\alpha }/(1-\varvec{\alpha })\), and the results follow. \(\square \)
3 Examples
For a given set \(\mathfrak {F}\subseteq \mathbb {R}\), knowing one of the invariants \(\mathbf{v}\), \(\mathbf{w}\), \(\varvec{\alpha }\) or \(\varvec{\beta }\) means knowing the others. Choosing the appropriate rounding \( fl \) or \(\text{ gl }\), i.e., the switching points being the arithmetic or the harmonic mean, the maximum error for both models is the same.
For any finite set \(\mathfrak {F}\) one invariant can be computed using (2.4) or (2.10), then the other invariants follow. As an example consider a logarithmic number system (LNS)
Earliest references of this well studied concept include [7, 9, 10]; see [1] and the literature cited over there for recent publications. Assume the integers \(k_1,k_2\) satisfy \(k_1<k_2\) to ensure that \(\varrho (\mathfrak {F})\) has at least two elements. Then \(\varrho (\mathfrak {F})=[c^{k_1},c^{k_2}]\), and (2.10) yields
These characteristic constants are the same for \(\mathfrak {F}\cup (-\mathfrak {F})\) or for \(\mathfrak {F}\cup \{0\}\cup (-\mathfrak {F})\). The elements could be stored by the integer exponent. If the product or quotient of elements in \(\mathfrak {F}\) is in \(\varrho (\mathfrak {F})\), then they are in \(\mathfrak {F}\). In terms of floating-point operations this means that multiplication and division are error-free if no over—or underflow occurs.
Moreover, those sets bear the advantage that the maximum relative rounding error for the first and for the second standard model is attained on each interval \([f,g]\) of adjacent elements in \(\varrho (\mathfrak {F})\). In contrast, there is a factor of up to \(\beta \) for floating-point systems to base \(\beta \), depending on whether the interval \([f,g]\) is slightly larger or smaller than a power of \(\beta \).
A possible choice is \(c:=2^\varepsilon \) with \(\varepsilon :=2^{-52}\). Spending one bit for the sign, integers up to about \({\pm }2^{62}\) can be stored in \(64\)-bit, so that the positive range is about \([2^{-1024}, 2^{1024}]\), similar to IEEE 754 double precision. The relative rounding error unit is then about \(0.77\times 10^{-16}\) compared to about \(1.11\times 10^{-16}\) in double precision. With this choice of \(c\), powers of \(2\) are in \(\mathfrak {F}\), however, as a disadvantage, \(\mathfrak {F}\) contains no other integers.
Figure 1 shows a typical graph of rounding errors for binary floating-point and for the logarithmic number system (3.1). The left two pictures show the relative rounding error with respect to the first and second standard model for \(p=4\) bits binary precision according to IEEE 754. Typically, the error decreases in steps towards the next power \(2\). The third picture is for a logarithmic number system as in (3.1), where the maximal error at the switching points is constant.
Another example is level-index arithmetic [2, 3]. This interesting approach was introduced to overcome problems with over—or underflow. Basically, \(\mathfrak {F}\) is defined to consist of \(s\cdot \Phi (\ell +x)^{t}\) for \(\ell \in \mathbb {N}_0\), \(s,t\in \{-1,1\}\) and \(x\in \mathcal {M}\) for some finite set \(\mathcal {M}\subseteq [0,1)\), where
The integer part \(\ell \) of \(y:=\ell +x\) is called the level of \(y\). Even if the level is limited to small integers, the range of \(\mathfrak {F}\) is huge comprising of numbers very small and very large in magnitude.
This basically solves the problem of over—and underflow, however, it comes with a price concerning the maximum relative error of floating-point operations. For \(\ell \in \mathbb {N}_0\), \(0<x\in \mathbb {R}\) and small \(\varepsilon \) we have
That means, the distance between adjacent elements in \(\mathfrak {F}\) increases exponentially. Thus by (3.3), the maximum relative errors \(\varvec{\alpha }\) and \(\varvec{\beta }\), which are characterized by Lemma 2.1, increase exponentially with the level as well. Hence avoiding over—and underflow problems results in a maximal relative error of floating-point operations close to 1. As a consequence, at least the usual way to analyze floating-point algorithms seems hardly possible.
4 Conclusion
From the perspective of a rounding function the situation for binary64 can be summarized as follows. If the switching point for the rounding is set to the arithmetic mean of adjacent elements, that is the usual rounding to nearest, then we obtain the bound \(\varvec{\beta }=2^{-53}\) for the second standard model, and \(\varvec{\alpha }=2^{-53}/(1+2^{-53})\) is the optimal error bound for the first standard model.
Defining the switching points to be the harmonic mean of adjacent elements, the invariants change places: The maximum relative error with respect to \(x\) (the first standard model) is \(\mathbf{v}=2^{-53}\), whereas the maximum error relative to its rounded value \(f\) (the second standard model) becomes \(\mathbf{w}=2^{-53}\!{/}(1+2^{-53})\). Now \(\mathbf{w}=\varvec{\alpha }\) becomes the optimal error bound for the second standard model.
By continuity there must be a switching point between the harmonic and the arithmetic mean for which both the maximum relative error of all \(x\) with respect to both standard models is bounded by the same constant. Using the arithmetic-geometric mean inequality one verifies that for adjacent floating-point numbers \(0<f<g\) the maximum relative error for both models is equal to \(\sqrt{g/f}-1\) for \(x\) equal to the geometric mean \(\sqrt{fg}\), and that this relative error is maximized for \(f=2^e\) and its successor \(g=(1+2\mathbf{v})2^e\).
It follows that for the switching points \(x:=\text{ sign }(f)\sqrt{|fg|}\) the relative errors of both the first and the second standard models are bounded by the same characteristic constant \(\sqrt{1+2\mathbf{v}}-1\), slightly less than \(\mathbf{v}\) and slightly larger than \(\mathbf{v}/(1+\mathbf{v})\). The relation harmonic \(\le \) geometric \(\le \) arithmetic mean finds its counterpart in the corresponding maximal relative errors \(\mathbf{v}/(1+\mathbf{v})\le \sqrt{1+2\mathbf{v}}-1 \le \mathbf{v}\).
Notes
For the standard formats \(\mathbb {F}\) in IEEE 754 the range could be slightly wider: For \(f\) denoting the rounded-to-nearest result in \(\mathbb {F}\) with infinite exponent range, return this \(f\) if it belongs to \(\mathbb {F}\) with the bounded exponent range. Since we are aiming on general sets \(\mathfrak {F}\), there is no notion of “exponent range”.
References
Arnold, M.G., Collange, S.: The denormal logarithmic number system. In: 24th IEEE International Conference on Application-specific Systems, Architectures and Processors (ASAP), pp. 117–124 (2013)
Clenshaw, C.W., Olver, F.W.J.: Beyond floating point. J. ACM 31(2), 319–328 (1984)
Clenshaw, C.W., Olver, F.W.J., Turner, P.R.: Level-index arithmetic: an introductory survey. Lect. Notes Math. 1397, 95–168 (1989)
Higham, N.J.: Accuracy and stability of numerical algorithms, 2nd edn. SIAM Publications, Philadelphia (2002)
IEEE Standard 754–2008: IEEE Standard for Floating-Point Arithmetic. IEEE Computer Society, New York (2008)
Jeannerod, C.-P., Rump, S.M.: On relative errors of floating-point operations: optimal bounds and applications. Preprint (2014)
Kingsburg, N.G., Rayner, P.J.W.: Digital filtering using logarithmic arithmetic. Electron. Lett. 7, 56–58 (1971)
Knuth, D.E.: The art of computer programming, 3rd edn. In: Seminumerical Algorithms, vol. 2. Addison-Wesley, Reading, Massachusetts (1998)
Lee, S.C., Edgar, A.D.: The focus number system. IEEE Trans. Comput. C–26, 1167–1170 (1977)
Swartzlander Jr, E.E., Alexopoulos, A.G.: The sign/logarithm number system. IEEE Trans. Comput. C–24, 1238–1243 (1975)
Acknowledgments
Our dearest thanks go to Claude-Pierre Jeannerod from Lyon for his many detailed comments and for very helpful discussions and suggestions. Moreover, many thanks to the anonymous referees for their valuable and constructive comments.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Axel Ruhe.
Rights and permissions
About this article
Cite this article
Rump, S.M., Lange, M. On the definition of unit roundoff. Bit Numer Math 56, 309–317 (2016). https://doi.org/10.1007/s10543-015-0554-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10543-015-0554-0