Aims

The aims of this chapter are to look in more depth at arithmetic and in particular at the support that Fortran provides for the IEEE 754 and later standards. There is a coverage of:

  • hardware support for arithmetic.

  • integer formats.

  • floating point formats: single and double.

  • special values: denormal, infinity and not a number — nan.

  • exceptions and flags: divide by zero, inexact, invalid, overflow, underflow.

36.1 Introduction

The literature contains details of the IEEE arithmetic standards. The bibliography contains details of a number of printed and on-line sources.

36.2 History

When we use programming languages to do arithmetic two major concerns are the ability to develop reliable and portable numerical software. Arithmetic is done in hardware and there are a number of things to consider:

  • the range of hardware available both now and in the past.

  • the evolution of hardware.

There has been a very considerable change in arithmetic units since the first computers. Table 36.1 is a list of hardware and computing systems that the authors have used or have heard of. It is not exhaustive or definitive, but rather reflects the authors’ age and experience.

Table 36.1 Computer hardware and manufacturers

Table 36.2 lists some of the operating systems.

Table 36.2 Operating systems

Again the list is not exhaustive or definitive. The intention is simply to provide some idea of the wide range of hardware, computer manufacturers and operating systems that have been around in the past 50 years.

To cope with the anarchy in this area Doctor Robert Stewart (acting on behalf of the IEEE) convened a meeting which led to the birth of IEEE 754.

The first draft, which was prepared by William Kahan, Jerome Coonen and Harold Stone, was called the KCS draft and eventually adopted as IEEE 754. A fascinating account of the development of this standard can be found in An Interview with the Old Man of Floating Point, and the bibliography provides a web address for this interview. Kahan went on to get the ACM Turing Award in 1989 for his work in this area.

This has become a de facto standard amongst arithmetic units in modern hardware. Note that it is not possible to describe precisely the answers a program will give, and the authors of the standard knew this. This goal is virtually impossible to achieve when one considers floating point arithmetic. Reasons for this include:

  • the conversions of numbers between decimal and binary formats.

  • the use of elementary library functions.

  • results of calculations may be in hardware inaccessible to the programmer.

  • intermediate results in subexpressions or arguments to procedures.

The bibliography contains details of a paper that addresses this issue in much greater depth — Differences Among IEEE 754 Implementations.

Fortran is one of a small number of languages that provides access to IEEE arithmetic, and it achieves this via TR1880 which is an integral part of Fortran 2003. The C standard (C9X) addresses this issue and Java offers limited IEEE arithmetic support. More information can be found in the references at the end of the chapter.

36.3 IEEE Specifications

There have been several IEEE arithmetic standards. The following information is taken from the ISO site.

The url is

figure a

ISO/IEC/IEEE 60559:2011(E) specifies formats and methods for floating-point arithmetic in computer systems - standard and extended functions with single, double, extended, and extendable precision and recommends formats for data interchange. Exception conditions are defined and standard handling of these conditions is specified. It provides a method for computation with floating-point numbers that will yield the same result whether the processing is done in hardware, software, or a combination of the two. The results of the computation will be identical, independent of implementation, given the same input data. Errors, and error conditions, in the mathematical processing will be reported in a consistent manner regardless of implementation. This first edition, published as ISO/IEC/IEEE 60559, replaces the second edition of IEC 60559.

Here is the standard history.

  • ISO/IEC/IEEE 60559:2011(E)

  • IEC 559:1989

  • IEC 559:1982

The standard provides coverage of the following areas, which is taken from the table of contents.

  • Floating-point formats

    • Overview

    • Specification levels

    • Sets of floating-point data

    • Binary interchange format encodings

    • Decimal interchange format encodings

    • Interchange format parameters

    • Extended and extendable precisions

  • Attributes and rounding

    • Attribute specification

    • Dynamic modes for attributes

    • Rounding-direction attributes

  • Operations

    • Overview

    • Decimal exponent calculation

    • Homogeneous general-computational operations

    • Format of general-computational operations

    • Quiet-computational operations

    • Signaling-computational operations

    • Non-computational operations

    • Details of conversions from floating-point to integer formats

    • Details of operations to round a floating-point datum to integral value

    • Details of totalorder predicate

    • Details of comparison predicates

    • Details of conversion between floating-point data and external character sequences

  • Infinity, NaNs, and sign bit

    • Infinity arithmetic

    • Operations with NaNs

    • The sign bit

  • Default exception handling

    • Overview: exceptions and flags

    • Invalid operation

    • Division by zero

    • Overflow

    • Underflow

    • Inexact

  • Alternate exception handling attributes

    • Overview

    • Resuming alternate exception handling attributes

    • Immediate and delayed alternate exception handling attributes

  • Recommended operations

    • Conforming language- and implementation-defined functions

    • Recommended correctly rounded functions

    • Operations on dynamic modes for attributes

    • Reduction operations

  • Expression evaluation

    • Expression evaluation rules

    • Assignments, parameters, and function values

    • preferred width attributes for expression evaluation

    • Literal meaning and value-changing optimizations

  • Reproducible floating-point results

36.4 Floating Point Formats

Table 36.3 summarises the formats specified in the IEEE 754-2008 standard.

Table 36.3 IEEE formats

36.5 Procedure Summary

Tables 36.4 and 36.5 summarise the procedures.

Table 36.4 IEEE Arithmetic module procedure summary
Table 36.5 IEEE Exceptions module procedure summary

36.6 General Comments About the Standard

The special bit patterns provide the following:

  • \( +0 \)

  • \( -0 \)

  • subnormal numbers in the range 1.17549421E-38 to 1.40129846E-45

  • \( + \infty \)

  • \( - \infty \)

  • quiet NaN (Not a Number)

  • signalling NaN

One of the first systems that the authors worked with that had special bit patterns set aside was the CDC 6000 range of computers that had negative indefinite and infinity. Thus the ideas are not new, as this was in the late 1970s.

The support of positive and negative zero means that certain problems can be handled correctly including:

  • The evaluation of the log function which has a discontinuity at zero.

  • The equation \( \sqrt{1/z} = 1/z \) can be solved when \( z = -1 \)

See also the Kahan paper Branch Cuts for complex Elementary functions, or Much Ado About Nothing’s Sign Bit for more details.

Subnormals, which permit gradual underflow, fill the gap between 0 and the smallest normal number.

Simply stated underflow occurs when the result of an arithmetic operation is so small that it is subject to a larger than normal rounding error when stored. The existence of subnormals means that greater precision is available with these small numbers than with normal numbers. The key features of gradual underflow are:

  • When underflow does occur there should never be a loss of accuracy any greater than that from ordinary roundoff.

  • The operations of addition, subtraction, comparison and remainder are always exact.

  • Algorithms written to take advantage of subnormal numbers have smaller error bounds than other systems.

  • if x and y are within a factor of 2 then x-y is error free, which is used in a number of algorithms that increase the precision at critical regions.

The combination of positive and negative zero and subnormal numbers means that when x and y are small and x-y has been flushed to zero the evaluation of \( 1 / (x-y) \) can be flagged and located.

Certain arithmetic operations cause problems including:

  • \( 0 * \infty \)

  • 0 / 0

  • \( \sqrt{x} \) when \( x < 0 \)

and the support for NaN handles these cases.

The support for positive and negative infinity allows the handling of x / 0 when x is nonzero and of either sign, and the outcome of this means that we write our programs to take the appropriate action. In some cases this would mean recalculating using another approach.

For more information see the references in the bibliography.

36.7 Resume

The above has provided a quick tour of the IEEE standard. We’ll now look at what Fortran has to offer to support it.

36.8 Fortran Support for IEEE Arithmetic

Fortran first introduced support for IEEE arithmetic in ISO TR 15580. The Fortran 2003 standard integrated support into the main standard. Fortran 2018 offers more support, and for more details one should consult Chap. 17 of that document.

The intrinsic modules

  • ieee_features

  • ieee_exceptions

  • ieee_arithmetic

provide support for exceptions and IEEE arithmetic. Whether the modules are provided is processor dependent. If the module ieee_features is provided, which of the named constants defined in this standard are included is processor dependent. The module ieee_arithmetic behaves as if it contained a use statement for ieee_exceptions; everything that is public in ieee_exceptions is public inieee_arithmetic.

The first thing to consider is the degree of conformance to the IEEE standard. It is possible that not all of the features are supported. Thus the first thing to do is to run one or more test programs to determine the degree of support for a particular system.

36.9 Derived Types and Constants Defined in the Modules

The modules

  • ieee_exceptions

  • ieee_arithmetic

  • ieee_features

define five derived types, whose components are all private.

36.9.1 ieee_exceptions

This module defines ieee_flag_type, for identifying a particular exception flag.

Possible values are

figure b

The module also defines the array named constants

figure c
figure d
figure e

The last is for saving the current floating point status.

36.9.2 ieee_arithmetic

This module defines ieee_class_type, for identifying a class of floating-point values.

Possible values are:

figure f

The module defines ieee_round_type, for identifying a particular rounding mode. Its only possible values are those of named constants defined in the module: ieee_nearest, ieee_to_zero, ieee_up, and ieee_down for the ieee_modes; and ieee_other for any other mode.

The elemental operator == for two values of one of these types to return true if the values are the same and false otherwise.

The elemental operator /= for two values of one of these types to return true if the values differ and false otherwise.

36.9.3 ieee_features

This module defines ieee_features_type, for expressing the need for particular ieee_features. Its only possible values are those of named constants defined in the module:

  • ieee_datatype

  • ieee_denormal

  • ieee_divide

  • ieee_halting

  • ieee_inexact_flag

  • ieee_inf

  • ieee_invalid_flag

  • ieee_nan

  • ieee_rounding

  • ieee_sqrt

  • ieee_underflow_flag

36.9.4 Further Information

There are a number of additional sources of information.

  • the Fortran standard.

  • documentation that comes with your compiler.

The latter has the benefit of describing what is supported in that compiler.

36.10 Example 1: Testing IEEE Support

The first examples test basic IEEE arithmetic support.

Here is a program to illustrate the above.

figure g

Table 36.6 summarises the support for a number of compilers.

Table 36.6 Compiler IEEE support for various precisions

36.11 Example 2: Testing What Flags Are Supported

Here is a program to illustrate the above.

figure h

Here is the output from the Intel compiler.

figure i

36.12 Example 3: Overflow

Here is a program to illustrate the above.

figure j

36.13 Example 4: Underflow

Here is a program to illustrate the above.

figure k

36.14 Example 5: Inexact Summation

Here is a program to illustrate the above.

figure l

Here is the output from several compilers.

figure m
figure n
figure o
figure p

What do you notice about the value of the computed sum?

36.15 Example 6: NAN and Other Specials

Here is a program to illustrate some additional IEEE functionality.

figure q

36.16 Summary

Compiler support in this area is now quite widespread as the above examples have shown.

36.17 Bibliography

Hauser J.R., Handling Floating Point Exceptions in Numeric programs, ACM Transaction on programming Languages and Systems, Vol. 18, No. 2, March 1996, pp. 139–174.

  • The paper looks at a number of techniques for handling floating point exceptions in numeric code. One of the conclusions is for better structured support for floating point exception handling in new programming languages, or of course better standards for existing languages.

IEEE, IEEE Standard for Binary Floating-Point Arithmetic, ANSI/IEEE Std 754-2008, Institute of Electrical and Electronic Engineers Inc.

  • The formal definition of IEEE 754. This is available for purchase as both a pdf and printed version - see the address below.

figure r

This standard specifies formats and methods for floating-point arithmetic in computer systems: standard and extended functions with single, double, extended, and extendable precision, and recommends formats for data interchange. Exception conditions are defined and standard handling of these conditions is specified. Keywords: 754-2008, arithmetic, binary, computer, decimal, exponent, floating-point, format, interchange, NaN, number, rounding, significand, subnormal. Product Code(s): STDPD95802,STD95802

Knuth D., Seminumerical Algorithms, Addison-Wesley, 1969.

  • There is a coverage of floating point arithmetic, multiple precision arithmetic, radix conversion and rational arithmetic.

Sun, Numerical Computation Guide, SunPro.

  • Very good coverage of the numeric formats for IEEE Standard 754 for Binary Floating-Point Arithmetic. All SunPro compiler products support the features of the IEEE 754 standard.

36.17.1 Web-Based Sources

  • Differences Among IEEE 754 Implementations. The material in this paper will eventually be included in the Sun Numerical Computation Guide as an addendum to Appendix C, David Goldberg’s What Every Computer Scientist Should Know about Floating Point Arithmetic.

figure s
  • The Numerical Computation Guide can be browsed on-line or downloaded as a pdf file. The last time we checked it was 294 pages. Good source of information if you have Sun equipment.

figure t
  • The Explosion of the Ariane 5: A 64-bit floating point number relating to the horizontal velocity of the rocket with respect to the platform was converted to a 16-bit signed integer. The number was larger than 32,768, the largest integer storeable in a 16-bit signed integer, and thus the conversion failed.

36.17.2 Hardware Sources

Amd - Visit

figure u

for details of the AMD manuals. The following five manuals are available for download as pdf’s from the above site.

  • AMD64 Architecture Programmer’s Manual Volume 1: Application Programming

  • AMD64 Architecture Programmer’s Manual Volume 2: System Programming

  • AMD64 Architecture Programmer’s Manual Volume 3: General Purpose and System Instructions

  • AMD64 Architecture Programmer’s Manual Volume 4: 128-bit and 256 bit media instructions

  • AMD64 Architecture Programmer’s Manual Volume 5: 64-Bit Media and x87 Floating-Point Instructions

Intel - Visit

figure v

for a list of manuals. The following three manuals are available for download as pdf’s from the above site.

  • Intel 64 and IA-32 Architectures Software Developer’s Manual. Volume 1: Basic Architecture

  • Intel 64 and IA-32 Architectures Software Developer’s Manual. Combined Volumes 2A and 2B: Instruction Set Reference, A-Z.

  • Intel 64 and IA-32 Architectures Software Developer’s Manual. Combined Volumes 3A and 3B: System Programming Guide, Parts 1 and 2

Osbourne A., Kane G., 4-bit and 8-bit Microprocessor Handbook, Osbourne and McGraw Hill, 1981.

  • Good source of information on 4-bit and 8-bit microprocessors.

Osbourne A., Kane G., 16-Bit Microprocessor Handbook, Osbourne and McGraw Hill, 1981.

  • Ditto 16-bit microprocessors.

Bhandarkar D.P., Alpha Implementations and Architecture: Complete Reference and Guide, Digital Press, 1996.

  • Looks at some of the trade-offs and design philosophy behind the alpha chip. The author worked with VAX, MicroVAX and VAX vectors as well as the Prism. Also looks at the GEM compiler technology that DEC/Compaq use.

Various companies home pages.

figure w

36.17.3 Operating Systems

Deitel H.M., An Introduction to Operating Systems, Addison-Wesley, 1990.

  • The revised first edition includes case studies of UNIX, VMS, CP/M, MVS and VM. The second edition adds OS/2 and the Macintosh operating systems. There is a coverage of hardware, software, firmware, process management, process concepts, asynchronous concurrent processes, concurrent programming, deadlock and indefinite postponement, storage management, real storage, virtual storage, processor management, distributed computing, disk performance optimisation, file and database systems, performance, coprocessors, risc, data flow, analytic modelling, networks, security and it concludes with case studies of the these operating systems. The book is well written and an easy read.

36.18 Problem

36.1

Compile and run each of the examples in this chapter with your compiler(s). If you have access to more than one compiler do the compilers behave in the same way?