On the definition of unit roundoff

Rump, Siegfried M.; Lange, Marko

doi:10.1007/s10543-015-0554-0

On the definition of unit roundoff

Published: 17 March 2015

Volume 56, pages 309–317, (2016)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

BIT Numerical Mathematics Aims and scope Submit manuscript

On the definition of unit roundoff

Download PDF

Siegfried M. Rump^1,2 &
Marko Lange¹

608 Accesses
6 Citations
Explore all metrics

Abstract

The result of a floating-point operation is usually defined to be the floating-point number nearest to the exact real result together with a tie-breaking rule. This is called the first standard model of floating-point arithmetic, and the analysis of numerical algorithms is often solely based on that. In addition, a second standard model is used specifying the maximum relative error with respect to the computed result. In this note we take a more general perspective. For an arbitrary finite set of real numbers we identify the rounding to minimize the relative error in the first or the second standard model. The optimal “switching points” are the arithmetic or the harmonic means of adjacent floating-point numbers. Moreover, the maximum relative error of both models is minimized by taking the geometric mean. If the maximum relative error in one model is $\alpha $, then $\alpha /(1-\alpha )$ is the maximum relative error in the other model. Those maximal errors, that is the unit roundoff, are characteristic constants of a given finite set of reals: The floating-point model to be optimized identifies the rounding and the unit roundoff.

On the maximum relative error when computing integer powers by iterated multiplications in floating-point arithmetic

Article 01 February 2015

Error estimates for the summation of real numbers with application to floating-point summation

Article 03 May 2017

A note on Dekker’s FastTwoSum algorithm

Article Open access 24 April 2020

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Usually floating-point arithmetic over a finite set $\mathbb {F}\subseteq \mathbb {R}$ is introduced by defining the result of a floating-point operation $\text{ op }\in \{+,-,*,/\}$ to be $ fl (a\,\text{ op }\, b)$ for $a,b\in \mathbb {F}$ and a rounding to nearest $ fl : \mathbb {R}\rightarrow \mathbb {F}$. The rounding is unique up to a tie-breaking rule.

More generally, rounding error analysis is often based on the first or the second standard model [4, Sect. 2]. The first standard model of arithmetic assumes

$$\begin{aligned} a,b\in \mathbb {F}: fl (a \,\text{ op }\, b) = (a \,\text{ op }\, b)(1+\delta ),\quad |\delta |\le u,\quad \text{ op }= +,-,*,/ , \end{aligned}$$

(1.1)

while the second standard model assumes

$$\begin{aligned} fl (a \,\text{ op }\, b) = \frac{a \,\text{ op }\, b}{1+\delta },\quad |\delta |\le u . \end{aligned}$$

(1.2)

This is true for all $a,b\in \mathbb {F}$ as long as no overflow or underflow occurs. The latter is the typical assumption for the analysis of numerical algorithms and also the general assumption for this note. For $\mathbb {F}$ denoting the set of IEEE 754 binary64 floating-point numbers [5] the relative rounding error unit $u:=2^{-53}$ is often used for both models.

In the following we assume an arbitrary finite set $\mathfrak {F}\subseteq \mathbb {R}$ to be given. Without further assumptions we analyze best possible approximation properties of real numbers by elements in $\mathfrak {F}$. For minimizing the relative error with respect to the true result, i.e. the first standard model, this is clearly rounding to nearest. Minimization with respect to the computed result, i.e., the second standard model, requires a different rounding to be discussed. Finally, we identify the rounding to minimize the error with respect to both standard models simultaneously.

The finiteness of $\mathfrak {F}$ implies that a small relative error cannot be achieved for real numbers which are very large or very small in absolute value. To state that more precisely we define the normal range ^{Footnote 1} of $\mathfrak {F}$ by

$$\begin{aligned} \varrho (\mathfrak {F}) := [\min \mathfrak {F}_{<0},\max \mathfrak {F}_{<0}] \cup [\min \mathfrak {F}_{>0},\max \mathfrak {F}_{>0}] \subset \mathbb {R}\end{aligned}$$

(1.3)

and assume $\varrho (\mathfrak {F})\ne \emptyset $. Zero may be in $\mathfrak {F}$ or not, but $0\notin {\varrho }(\mathfrak {F})$ by (1.3). Here $x>\max \mathfrak {F}_{>0}$ or $x<\min \mathfrak {F}_{<0}$ corresponds to overflow, and $\max \mathfrak {F}_{<0} <x< \min \mathfrak {F}_{>0}$ to underflow. Note that $\varrho (\mathfrak {F})$ consists of all real numbers in one of the two intervals in (1.3).

2 Main results

The smallest possible constant $u$ for the first standard model (1.1) is characterized by

$$\begin{aligned} \varvec{\alpha }:= \varvec{\alpha }(\mathfrak {F}) = \sup \limits _{x\in {\varrho }(\mathfrak {F})} \min \limits _{f\in \mathfrak {F}} \left| \frac{f-x}{x}\right| . \end{aligned}$$

(2.1)

Note that the minimum is taken over all $f\in \mathfrak {F}$, whereas the maximum is taken over all x in $\varrho (\mathfrak {F})$. The constant $\varvec{\alpha }$ minimizes the relative error with respect to $x$, i.e., the maximum error of the first standard model (1.1). It is a characteristic constant of a given set $\mathfrak {F}$.

For a given $x\in {\varrho }(\mathfrak {F})$, denote a minimizing $f\in \mathfrak {F}$ in (2.1) by $ fl (x)$. Definition (2.1), the continuity of $|(f-x)/x|$ and the finiteness of $\mathfrak {F}$ imply that $ fl (x)$ is uniquely defined for all except finitely many $x$ in $\varrho (\mathfrak {F})$. The latter are called the switching points of the otherwise uniquely defined rounding $ fl : {\varrho }(\mathfrak {F})\rightarrow \mathfrak {F}$. Any “tie-breaking” method, that chooses a particular $f$ value at each switching point, defines a particular round-to-nearest function $ fl $. Note that for rounding to nearest the switching points are the arithmetic means of adjacent elements in $\mathfrak {F}$.

The rounding function can be extended to $ fl : \mathbb {R}\rightarrow \mathfrak {F}$, so that any rounding to nearest $ fl : \mathbb {R}\rightarrow \mathfrak {F}$ with $| fl (x)-x|=\min \{|f'-x|:f'\in \mathfrak {F}\}$ for all $x\in \mathbb {R}$ satisfies

$$\begin{aligned} \varvec{\alpha }= \sup \limits _{x\in {\varrho }(\mathfrak {F})} \left| \frac{ fl (x)-x}{x}\right| , \end{aligned}$$

(2.2)

no matter how ties are rounded.

For rounding to nearest $ fl : \mathbb {R}\rightarrow \mathfrak {F}$ the maximum error of the first standard model (1.1) is minimized, whereas the maximum error for the second standard model (1.2) is given by

$$\begin{aligned} \varvec{\beta }:= \varvec{\beta }(\mathfrak {F}) = \sup \limits _{x\in {\varrho }(\mathfrak {F})} \left| \frac{ fl (x)-x}{ fl (x)}\right| . \end{aligned}$$

(2.3)

Since the supremum is taken, the definition of $\varvec{\beta }$ is independent of a tie-breaking rule. Thus, $\varvec{\alpha }$ and $\varvec{\beta }$ are well defined and are characteristic invariants of the set $\mathfrak {F}$, and they are related as follows.

Lemma 2.1

Let finite $\mathfrak {F}\subseteq \mathbb {R}$ with $\varrho (\mathfrak {F})\ne \emptyset $ be given. If the positive and the negative parts of $\varrho (\mathfrak {F})$ consist of at most one element, then $\varvec{\alpha }=\varvec{\beta }=0$. Otherwise,

$$\begin{aligned} \varvec{\alpha }= \frac{\varvec{\beta }}{1+\varvec{\beta }} = \max \left\{ \left| \frac{g-f}{g+f}\right| : f,g \,\mathrm{adjacent\, in }\,\varrho (\mathfrak {F}) \,\mathrm{with }\, fg>0 \right\} < 1 \end{aligned}$$

(2.4)

for $\varvec{\alpha }$ and $\varvec{\beta }$ defined by (2.1) and (2.3), respectively.

Proof

If the positive and/or negative part of $\varrho (\mathfrak {F})$ is empty or consists of only one element, there is nothing to prove for that part. Let fixed but arbitrary positive adjacent elements $f,g\in \mathfrak {F}$ be given with $0<f<g$. For $x\in [f,g]$ we have $ fl (x)\in \{f,g\}$, and it is not hard to see that $| fl (x)-x|/|x|$ is maximized for $x=m:=(f+g)/2$. This maximum is

$$\begin{aligned} \alpha :=\frac{m-f}{m} = \frac{g-m}{m} = \frac{(g-f)/2}{(g+f)/2} = \frac{g-f}{g+f} . \end{aligned}$$

This implies in particular $\alpha <1$. For increasing $x\in [f,m)$ the function $|f-x|/|f|$ increases, and for increasing $x\in (m,g]$ the function $|g-x|/|g|$ decreases. For $x:=m-\varepsilon $ and sufficiently small $\varepsilon >0$ we have $ fl (x)=f$ and

$$\begin{aligned} \frac{|f-x|}{|f|} = \frac{(g+f)/2-\varepsilon -f}{f} = \frac{g-f}{2f} - \mathcal {O}(\varepsilon ) . \end{aligned}$$

On the other hand, for $x:=m+\varepsilon $ and sufficiently small $\varepsilon >0$ we have $ fl (x)=g$, and therefore

$$\begin{aligned} \frac{|g-x|}{|g|} = \frac{g-(g+f)/2-\varepsilon }{g} < \frac{g-f}{2g} < \frac{g-f}{2f} . \end{aligned}$$

Thus

$$\begin{aligned} \beta := \sup \left\{ \left| \frac{ fl (x)-x}{ fl (x)}\right| : f\le x\le g\right\} = \frac{g-f}{2f} . \end{aligned}$$

Moreover,

$$\begin{aligned} \frac{\alpha }{1-\alpha } = \frac{(g-f)/(g+f)}{(g+f-g+f)/(g+f)} = \frac{g-f}{2f} = \beta \quad \text{ and } \text{ therefore }\quad \alpha =\frac{\beta }{1+\beta }. \end{aligned}$$

A similar argument applies to negative adjacent elements $f,g\in {\varrho }(\mathfrak {F})$, so that on every such interval $[f,g]$ the maximal $\alpha $ and $\beta $ are related by $\beta =\alpha /(1-\alpha )$. Since $f,g$ are chosen arbitrarily and since the mapping $\alpha \rightarrow \alpha /(1-\alpha )$ is increasing for $0\le \alpha <1$, the proof is finished. $\square $

For $\mathfrak {F}$ denoting the set $\mathbb {F}$ of IEEE 754 binary64 floating-point numbers barring underflow we may restrict $x$ in Definition (2.1) to the interval $[1,2)$. Abbreviating $u:=2^{-53}$, the maximum is achieved for $x=1+u$, the midpoint of $1$ and its successor in $\mathbb {F}$. It follows $\varvec{\alpha }=u/(1+u)$ for the first standard model, and $\varvec{\beta }=u$ for the second one, the characteristic constants of $\mathbb {F}$. In the literature (e.g., [4]) the same constant $u$ is used for both standard models. This improvement for the first model was noted in [8], see also [6].

Next we ask for a rounding realizing the smallest possible constant $u$ for the second standard model (1.2). The corresponding characteristic constant is defined by

$$\begin{aligned} \mathbf{w}:= \mathbf{w}(\mathfrak {F}) = \sup \limits _{x\in {\varrho }(\mathfrak {F})} \min \limits _{f\in \mathfrak {F}\backslash \{0\}} \left| \frac{f-x}{f}\right| . \end{aligned}$$

(2.5)

For given $x\in {\varrho }(\mathfrak {F})$ we first identify $\varphi \in \mathfrak {F}$ minimizing $|\varphi -x|/|\varphi |$. If $x\in \mathfrak {F}$, then clearly it suffices to take $\varphi :=x$. Otherwise the definition (1.3) of the range implies the existence of adjacent $f,g\in \mathfrak {F}\cap {\varrho }(\mathfrak {F})$ with $f<x<g$. The function $|\varphi -x|/|\varphi |$ decreases on $\varphi \in (0,x]$ and increases on $\varphi \in [x,\infty )$, so that $\varphi \in \{f,g\}$. For increasing $y\in [f,g]$ the ratio $|f-y|/|f|$ increases and $|g-y|/|g|$ decreases. Equality is achieved for $y:=h(f,g):=2(\frac{1}{f}+\frac{1}{g})^{-1}$, the harmonic mean of $f$ and $g$, because

$$\begin{aligned} \left| \frac{f-y}{f}\right| = \frac{2fg/(f+g)-f}{f} = \frac{g-f}{g+f} = \frac{g-2fg/(f+g)}{g} = \left| \frac{g-y}{g}\right| .\qquad \end{aligned}$$

(2.6)

Thus $\varphi $ minimizing $|\varphi -x|/|\varphi |$ is $f$ for $x\in [f,h(f,g))$, it is $g$ for $x\in (h(f,g),g]$, and is either $f$ or $g$ if $x=h(f,g)$. Hence a rounding $\text{ gl }: {\varrho }(\mathfrak {F})\rightarrow \mathfrak {F}$ with the property

$$\begin{aligned} \left| \frac{\text{ gl }(x)-x}{\text{ gl }(x)}\right| = \min \limits _{f\in \mathfrak {F}} \left| \frac{f-x}{f}\right| \end{aligned}$$

is characterized by being a projection and satisfying

$$\begin{aligned} \text{ gl }(x) \; \left\{ \begin{array}{ll} = f &{} \quad \text{ for } x\in [f,h) \\ \in \{f,g\} &{} \quad \text{ for } x=h \\ = g &{} \quad \text{ for } x\in (h,g] \end{array}\right. \end{aligned}$$

(2.7)

for $x\in [f,g]$ with adjacent $f,g\in {\varrho }(\mathfrak {F})$ and $h$ denoting the harmonic mean of $f$ and $g$.

For rounding to nearest the switching point is the arithmetic mean of adjacent elements $f,g\in \mathfrak {F}$, now it is the harmonic mean. In both cases there is freedom about the tie without jeopardizing the minimization properties.

For a rounding $\text{ gl }: {\varrho }(\mathfrak {F})\rightarrow \mathfrak {F}$ with (2.7) we showed

$$\begin{aligned} \mathbf{w}= \mathbf{w}(\mathfrak {F}) = \sup \limits _{x\in \varrho (\mathfrak {F})} \left| \frac{\text{ gl }(x)-x}{\text{ gl }(x)}\right| , \end{aligned}$$

(2.8)

minimizing the error of the second standard model (1.2). Note that $\mathbf{w}$, just like $\varvec{\alpha }$, solely depends on $\mathfrak {F}$. Using the rounding $\text{ gl }: \varrho (\mathfrak {F})\rightarrow \mathfrak {F}$ with (2.7) to minimize the error in the second standard model (1.2), the maximum error for the first standard model (1.1) is given by

$$\begin{aligned} \mathbf{v}:= \mathbf{v}(\mathfrak {F}) = \sup \limits _{x\in \varrho (\mathfrak {F})} \left| \frac{\text{ gl }(x)-x}{x}\right| . \end{aligned}$$

(2.9)

As before we will show that $\mathbf{v}$ is invariant for any rounding $\text{ gl }: \mathbb {R}\rightarrow \mathfrak {F}$ with (2.7). Thus $\mathbf{v}$ and $\mathbf{w}$ are well defined and are again characteristic invariants of the set $\mathfrak {F}$. For any $\mathfrak {F}$, the invariants $\varvec{\alpha }, \varvec{\beta }, \mathbf{v}, \mathbf{w}$ are related as follows.

Lemma 2.2

Let finite $\mathfrak {F}\subseteq \mathbb {R}$ with $\varrho (\mathfrak {F})\ne \emptyset $ be given. Let $\mathbf{v}$, $\mathbf{w}$, $\varvec{\alpha }$ and $\varvec{\beta }$ as defined in (2.9), (2.5), (2.1) and (2.3), respectively. If the positive and the negative parts of $\varrho (\mathfrak {F})$ consist of at most one element, then $\mathbf{v}=\mathbf{w}=0$. Otherwise,

$$\begin{aligned} \mathbf{v}= \frac{\mathbf{w}}{1-\mathbf{w}} = \max \left\{ \left| \frac{g-f}{2\min \{|f|,|g|\}}\right| : f,g \,\mathrm{adjacent\, in }\,\varrho (\mathfrak {F}) \,\mathrm{with }\, fg>0 \right\} \qquad \quad \end{aligned}$$

(2.10)

and therefore

$$\begin{aligned} \mathbf{v}= \varvec{\beta }\quad \mathrm{and}\quad \mathbf{w}=\varvec{\alpha }<1 . \end{aligned}$$

(2.11)

Proof

If the positive and/or negative part of $\varrho (\mathfrak {F})$ is empty or consists of only one element, there is nothing to prove for that part. Let fixed but arbitrary positive adjacent elements $f,g\in \mathfrak {F}$ be given with $0<f<g$. By (2.6) the ratio $|\text{ gl }(x)-x|/|\text{ gl }(x)|$ is maximal for $x$ being the harmonic mean of $f$ and $g$. That maximum value is $(g-f)/(g+f)$, so that applying this argument to the negative part of $\varrho (\mathfrak {F})$ and using (2.4) shows $\mathbf{w}=\varvec{\alpha }<1$.

The supremum of the ratio $|\text{ gl }(x)-x|/|x|$ is also achieved for the harmonic mean $x:=2fg/(f+g)$, and using $x-f<g-x$ it is equal to

$$\begin{aligned} \frac{g-x}{x} = \frac{g}{2fg/(f+g)} - 1 = \frac{g-f}{2f} = \frac{\alpha }{1-\alpha } \end{aligned}$$

for $\alpha :=(g-f)/(g+f)$. Applying the same argument to the negative part of $\varrho (\mathfrak {F})$ and using (2.9) and (2.4) implies

$$\begin{aligned} \mathbf{v}= \sup \left\{ \left| \frac{g-f}{2\min \{|f|,|g|\}}\right| : f,g \,\text{ adjacent } \text{ in }\,\varrho (\mathfrak {F}) \,\text{ with }\, fg>0 \right\} = \frac{\varvec{\alpha }}{1-\varvec{\alpha }} . \end{aligned}$$

A computation using (2.4) yields $\varvec{\beta }= \varvec{\alpha }/(1-\varvec{\alpha })$, and the results follow. $\square $

3 Examples

For a given set $\mathfrak {F}\subseteq \mathbb {R}$, knowing one of the invariants $\mathbf{v}$, $\mathbf{w}$, $\varvec{\alpha }$ or $\varvec{\beta }$ means knowing the others. Choosing the appropriate rounding $ fl $ or $\text{ gl }$, i.e., the switching points being the arithmetic or the harmonic mean, the maximum error for both models is the same.

For any finite set $\mathfrak {F}$ one invariant can be computed using (2.4) or (2.10), then the other invariants follow. As an example consider a logarithmic number system (LNS)

$$\begin{aligned} \mathfrak {F}:= \{ c^k : k\in \mathbb {Z}, k_1\le k\le k_2\} \quad \text{ for } \quad 1<c\in \mathbb {R}. \end{aligned}$$

(3.1)

Earliest references of this well studied concept include [7, 9, 10]; see [1] and the literature cited over there for recent publications. Assume the integers $k_1,k_2$ satisfy $k_1<k_2$ to ensure that $\varrho (\mathfrak {F})$ has at least two elements. Then $\varrho (\mathfrak {F})=[c^{k_1},c^{k_2}]$, and (2.10) yields

$$\begin{aligned} \mathbf{v}= \frac{c^{k+1}-c^k}{2c^k} = \frac{c-1}{2}\quad \text{ and }\quad \mathbf{w}= \frac{\mathbf{v}}{1+\mathbf{v}} = \frac{c-1}{c+1} . \end{aligned}$$

These characteristic constants are the same for $\mathfrak {F}\cup (-\mathfrak {F})$ or for $\mathfrak {F}\cup \{0\}\cup (-\mathfrak {F})$. The elements could be stored by the integer exponent. If the product or quotient of elements in $\mathfrak {F}$ is in $\varrho (\mathfrak {F})$, then they are in $\mathfrak {F}$. In terms of floating-point operations this means that multiplication and division are error-free if no over—or underflow occurs.

Moreover, those sets bear the advantage that the maximum relative rounding error for the first and for the second standard model is attained on each interval $[f,g]$ of adjacent elements in $\varrho (\mathfrak {F})$. In contrast, there is a factor of up to $\beta $ for floating-point systems to base $\beta $, depending on whether the interval $[f,g]$ is slightly larger or smaller than a power of $\beta $.

A possible choice is $c:=2^\varepsilon $ with $\varepsilon :=2^{-52}$. Spending one bit for the sign, integers up to about ${\pm }2^{62}$ can be stored in $64$-bit, so that the positive range is about $[2^{-1024}, 2^{1024}]$, similar to IEEE 754 double precision. The relative rounding error unit is then about $0.77\times 10^{-16}$ compared to about $1.11\times 10^{-16}$ in double precision. With this choice of $c$, powers of $2$ are in $\mathfrak {F}$, however, as a disadvantage, $\mathfrak {F}$ contains no other integers.

Figure 1 shows a typical graph of rounding errors for binary floating-point and for the logarithmic number system (3.1). The left two pictures show the relative rounding error with respect to the first and second standard model for $p=4$ bits binary precision according to IEEE 754. Typically, the error decreases in steps towards the next power $2$. The third picture is for a logarithmic number system as in (3.1), where the maximal error at the switching points is constant.

Another example is level-index arithmetic [2, 3]. This interesting approach was introduced to overcome problems with over—or underflow. Basically, $\mathfrak {F}$ is defined to consist of $s\cdot \Phi (\ell +x)^{t}$ for $\ell \in \mathbb {N}_0$, $s,t\in \{-1,1\}$ and $x\in \mathcal {M}$ for some finite set $\mathcal {M}\subseteq [0,1)$, where

$$\begin{aligned} \Phi (x) = \left\{ \begin{array}{ll} x &{} \quad \text{ for } 0 \le x < 1 \\ e^{\Phi (x-1)} &{} \quad \text{ for } x \ge 1 .\end{array}\right. \end{aligned}$$

(3.2)

The integer part $\ell $ of $y:=\ell +x$ is called the level of $y$. Even if the level is limited to small integers, the range of $\mathfrak {F}$ is huge comprising of numbers very small and very large in magnitude.

This basically solves the problem of over—and underflow, however, it comes with a price concerning the maximum relative error of floating-point operations. For $\ell \in \mathbb {N}_0$, $0<x\in \mathbb {R}$ and small $\varepsilon $ we have

$$\begin{aligned} \Phi (\ell +x+\varepsilon ) = \Phi (\ell +x) \cdot \left( 1+\varepsilon \prod _{\nu =1}^{\ell -1}{\Phi (\nu +x)} \right) + \mathcal {O}(\varepsilon ^2). \end{aligned}$$

(3.3)

That means, the distance between adjacent elements in $\mathfrak {F}$ increases exponentially. Thus by (3.3), the maximum relative errors $\varvec{\alpha }$ and $\varvec{\beta }$, which are characterized by Lemma 2.1, increase exponentially with the level as well. Hence avoiding over—and underflow problems results in a maximal relative error of floating-point operations close to 1. As a consequence, at least the usual way to analyze floating-point algorithms seems hardly possible.

4 Conclusion

From the perspective of a rounding function the situation for binary64 can be summarized as follows. If the switching point for the rounding is set to the arithmetic mean of adjacent elements, that is the usual rounding to nearest, then we obtain the bound $\varvec{\beta }=2^{-53}$ for the second standard model, and $\varvec{\alpha }=2^{-53}/(1+2^{-53})$ is the optimal error bound for the first standard model.

Defining the switching points to be the harmonic mean of adjacent elements, the invariants change places: The maximum relative error with respect to $x$ (the first standard model) is $\mathbf{v}=2^{-53}$, whereas the maximum error relative to its rounded value $f$ (the second standard model) becomes $\mathbf{w}=2^{-53}\!{/}(1+2^{-53})$. Now $\mathbf{w}=\varvec{\alpha }$ becomes the optimal error bound for the second standard model.

By continuity there must be a switching point between the harmonic and the arithmetic mean for which both the maximum relative error of all $x$ with respect to both standard models is bounded by the same constant. Using the arithmetic-geometric mean inequality one verifies that for adjacent floating-point numbers $0<f<g$ the maximum relative error for both models is equal to $\sqrt{g/f}-1$ for $x$ equal to the geometric mean $\sqrt{fg}$, and that this relative error is maximized for $f=2^e$ and its successor $g=(1+2\mathbf{v})2^e$.

It follows that for the switching points $x:=\text{ sign }(f)\sqrt{|fg|}$ the relative errors of both the first and the second standard models are bounded by the same characteristic constant $\sqrt{1+2\mathbf{v}}-1$, slightly less than $\mathbf{v}$ and slightly larger than $\mathbf{v}/(1+\mathbf{v})$. The relation harmonic $\le $ geometric $\le $ arithmetic mean finds its counterpart in the corresponding maximal relative errors $\mathbf{v}/(1+\mathbf{v})\le \sqrt{1+2\mathbf{v}}-1 \le \mathbf{v}$.

Notes

For the standard formats $\mathbb {F}$ in IEEE 754 the range could be slightly wider: For $f$ denoting the rounded-to-nearest result in $\mathbb {F}$ with infinite exponent range, return this $f$ if it belongs to $\mathbb {F}$ with the bounded exponent range. Since we are aiming on general sets $\mathfrak {F}$, there is no notion of “exponent range”.

References

Arnold, M.G., Collange, S.: The denormal logarithmic number system. In: 24th IEEE International Conference on Application-specific Systems, Architectures and Processors (ASAP), pp. 117–124 (2013)
Clenshaw, C.W., Olver, F.W.J.: Beyond floating point. J. ACM 31(2), 319–328 (1984)
Article MathSciNet MATH Google Scholar
Clenshaw, C.W., Olver, F.W.J., Turner, P.R.: Level-index arithmetic: an introductory survey. Lect. Notes Math. 1397, 95–168 (1989)
Article MathSciNet MATH Google Scholar
Higham, N.J.: Accuracy and stability of numerical algorithms, 2nd edn. SIAM Publications, Philadelphia (2002)
Book MATH Google Scholar
IEEE Standard 754–2008: IEEE Standard for Floating-Point Arithmetic. IEEE Computer Society, New York (2008)
Jeannerod, C.-P., Rump, S.M.: On relative errors of floating-point operations: optimal bounds and applications. Preprint (2014)
Kingsburg, N.G., Rayner, P.J.W.: Digital filtering using logarithmic arithmetic. Electron. Lett. 7, 56–58 (1971)
Article Google Scholar
Knuth, D.E.: The art of computer programming, 3rd edn. In: Seminumerical Algorithms, vol. 2. Addison-Wesley, Reading, Massachusetts (1998)
Lee, S.C., Edgar, A.D.: The focus number system. IEEE Trans. Comput. C–26, 1167–1170 (1977)
Article Google Scholar
Swartzlander Jr, E.E., Alexopoulos, A.G.: The sign/logarithm number system. IEEE Trans. Comput. C–24, 1238–1243 (1975)
Article MathSciNet MATH Google Scholar

Download references

Acknowledgments

Our dearest thanks go to Claude-Pierre Jeannerod from Lyon for his many detailed comments and for very helpful discussions and suggestions. Moreover, many thanks to the anonymous referees for their valuable and constructive comments.

Author information

Authors and Affiliations

Institute for Reliable Computing, Hamburg University of Technology, Schwarzenbergstraße 95, 21071, Hamburg, Germany
Siegfried M. Rump & Marko Lange
Faculty of Science and Engineering, Waseda University, 3-4-1 Okubo, Shinjuku-ku, Tokyo, 169-8555, Japan
Siegfried M. Rump

Authors

Siegfried M. Rump
View author publications
You can also search for this author in PubMed Google Scholar
Marko Lange
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Siegfried M. Rump.

Additional information

Communicated by Axel Ruhe.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rump, S.M., Lange, M. On the definition of unit roundoff. Bit Numer Math 56, 309–317 (2016). https://doi.org/10.1007/s10543-015-0554-0

Download citation

Received: 25 August 2014
Accepted: 02 March 2015
Published: 17 March 2015
Issue Date: March 2016
DOI: https://doi.org/10.1007/s10543-015-0554-0

Keywords

Mathematics Subject Classification

65G50

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

On the definition of unit roundoff

Abstract

Similar content being viewed by others

On the maximum relative error when computing integer powers by iterated multiplications in floating-point arithmetic

Error estimates for the summation of real numbers with application to floating-point summation

A note on Dekker’s FastTwoSum algorithm

1 Introduction

2 Main results

Lemma 2.1

Proof

Lemma 2.2

Proof

3 Examples

4 Conclusion

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

On the definition of unit roundoff

Abstract

Similar content being viewed by others

On the maximum relative error when computing integer powers by iterated multiplications in floating-point arithmetic

Error estimates for the summation of real numbers with application to floating-point summation

A note on Dekker’s FastTwoSum algorithm

1 Introduction

2 Main results

Lemma 2.1

Proof

Lemma 2.2

Proof

3 Examples

4 Conclusion

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation