IEEE 754-1985

IEEE 754-1985

The IEEE Standard for Binary Floating-Point Arithmetic (IEEE 754) is the most widely-used standard for floating-point computation, and is followed by many CPU and FPU implementations. The standard defines formats for representing floating-point numbers (including negative zero and denormal numbers) and special values (infinities and NaNs) together with a set of "floating-point operations" that operate on these values. It also specifies four rounding modes and five exceptions (including when the exceptions occur, and what happens when they do occur).

ummary

IEEE 754 specifies four formats for representing floating-point values: single-precision (32-bit), double-precision (64-bit), single-extended precision (≥ 43-bit, not commonly used) and double-extended precision (≥ 79-bit, usually implemented with 80 bits). Only 32-bit values are required by the standard; the others are optional. Many languages specify that IEEE formats and arithmetic be implemented, although sometimes it is optional. For example, the C programming language, which pre-dated IEEE 754, now allows but does not require IEEE arithmetic (the C float typically is used for IEEE single-precision and double uses IEEE double-precision).

The full title of the standard is IEEE Standard for Binary Floating-Point Arithmetic (ANSI/IEEE Std 754-1985), and it is also known as IEC 60559:1989, Binary floating-point arithmetic for microprocessor systems (originally the reference number was IEC 559:1989). [cite web |title = Referenced documents |url = http://www.opengroup.org/onlinepubs/009695399/frontmatter/refdocs.html |quote = IEC 60559:1989, Binary Floating-Point Arithmetic for Microprocessor Systems (previously designated IEC 559:1989)] Later there was an IEEE 854-1987 for "radix independent floating point" as long as the radix is 2 or 10. In June 2008, a major revision to IEEE 754 and IEEE 854 was approved by the IEEE. "See IEEE 754r".

tructure of a floating-point number

Following is a description of the standard's format for floating-point numbers.

Bit conventions used in this article

Bits within a word of width W are indexed by integers in the range 0 to W−1 inclusive. The bit with index 0 is drawn on the right. Unless otherwise stated, the lowest indexed bit is the lsb (Least Significant Bit, the one that if changed would cause the smallest variation of the represented value).

General layout

Binary floating-point numbers are stored in a sign-magnitude form where the most significant bit is the sign bit, "exponent" is the biased exponent, and "fraction" is the significand without the "most significant bit".

Exponent biasing

The exponent is biased by (2^{e-1})-1, where e is the number of bits used for the exponent field (e.g. if e=8, then (2^{8-1})-1 = 128 - 1 = 127 ). See also Excess-"N". Biasing is done because exponents have to be signed values in order to be able to represent both tiny and huge values, but two's complement, the usual representation for signed values, would make comparison harder. To solve this, the exponent is biased before being stored by adjusting its value to put it within an unsigned range suitable for comparison.

For example, to represent a number which has exponent of 17 in an exponent field 8 bits wide:
exponentfield=17 + (2^{8-1}) - 1 = 17 + 128 - 1 = 144.

Cases

The most significant bit of the significand (not stored) is determined by the value of "exponent". If 0 < "exponent" < 2^{e} - 1, the most significant bit of the "significand" is 1, and the number is said to be "normalized". If "exponent" is 0, the most significant bit of the "significand" is 0 and the number is said to be "de-normalized". Three special cases arise:
# if "exponent" is 0 and "fraction" is 0, the number is ±0 (depending on the sign bit)
# if "exponent" = 2^{e} - 1 and "fraction" is 0, the number is ±infinity (again depending on the sign bit), and
# if "exponent" = 2^{e} - 1 and "fraction" is not 0, the number being represented is not a number (NaN).

This can be summarized as:

As an example, 16,777,217 can not be encoded as a 32-bit float as it will be rounded to 16,777,216. Which is a good example of why floating point arithmetic is unsuitable for accounting software. However, all integers within the representable range that are a power of 2 can be stored in a 32-bit float without rounding.

A more complex example

The decimal number −118.625 is encoded using the IEEE 754 system as follows:

# The sign, the exponent, and the fraction are extracted from the original number. First, because the number is negative, the sign bit is "1".
# Next, the number (without the sign; i.e., unsigned, no two's complement) is converted to binary notation, giving 1110110.101. The 101 after the binary point has the value 0.625 because it is the sum of:
## (2−1) &times; 1, from the first digit after the binary point
## (2−2) &times; 0, from the second digit
## (2−3) &times; 1, from the third digit.
# That binary number is then "normalized"; that is, the binary point is moved left, leaving only a 1 to its left. The number of places it is moved gives the (power of two) exponent: 1110110.101 becomes 1.110110101 &times; 26. After this process, the first binary digit is always a 1, so it need not be included in the encoding. The rest is the part to the right of the binary point, which is then padded with zeros on the right to make 23 bits in all, which becomes the significand bits in the encoding: That is, 11011010100000000000000.
# The exponent is 6. This is encoded by converting it to binary and biasing it (so the most negative encodable exponent is 0, and all exponents are non-negative binary numbers). For the 32-bit IEEE 754 format, the bias is +127 and so 6 + 127 = 133. In binary, this is encoded as 10000101.

Double-precision 64 bit

Double precision is essentially the same except that the fields are wider:

The fraction part is much larger, while the exponent is only slightly larger. NaNs and Infinities are represented with Exp being all 1s (2047). If the fraction part is all zero then it is Infinity, else it is NaN.

For Normalized numbers the exponent bias is +1023 (so e is exponent (− 1023)). For Denormalized numbers the exponent is (−1022) (the minimum exponent for a normalized number&mdash;it is not (−1023) because normalised numbers have a leading 1 digit before the binary point and denormalized numbers do not). As before, both infinity and zero are signed.

Notes:
# The positive and negative numbers closest to zero (represented by the denormalized value with all 0s in the Exp field and the binary value 1 in the Fraction field) are
#: ±2−1074 ≈ ±5e|−324
# The positive and negative normalized numbers closest to zero (represented with the binary value 1 in the Exp field and 0 in the fraction field) are
#: ±2−1022 ≈ ±2.2250738585072020e|−308
# The finite positive and finite negative numbers furthest from zero (represented by the value with 2046 in the Exp field and all 1s in the fraction field) are
#: ±((1-(1/2)53)21024) ≈ ±1.7976931348623157e|308

Comparing floating-point numbers

Every possible bit combination is either a NaN or a number with a unique value in the affinely extended real number system with its associated order, except for the two bit combinations negative zero and positive zero, which sometimes require special attention (see below). The binary representation has the special property that, excluding NaNs, any two numbers can be compared like sign and magnitude integers (although with modern computer processors this is no longer directly applicable): if the sign bit is different, the negative number precedes the positive number (except that negative zero and positive zero should be considered equal), otherwise, relative order is the same as lexicographical order but inverted for two negative numbers; endianness issues apply.

Floating-point arithmetic is subject to rounding that may affect the outcome of comparisons on the results of the computations.

Although negative zero and positive zero are generally considered equal for comparison purposes, some programming language relational operators and similar constructs might or do treat them as distinct. According to the Java Language Specification, [ [http://java.sun.com/docs/books/jls/ The Java Language Specification] ] comparison and equality operators treat them as equal, but Math.min() and Math.max() distinguish them (officially starting with Java version 1.1 but actually with 1.1.1), as do the comparison methods equals(), compareTo() and even compare() of classes Float and Double. For C++, the standard does not have anything to say on the subject, so it is important to verify this (one environment tested treated them as equal when using a floating-point variable and treated them as distinct and with negative zero preceding positive zero when comparing floating-point literals).

Rounding floating-point numbers

The IEEE standard has four different rounding modes; the first is the default; the others are called "directed roundings".

* Round to Nearest &ndash; rounds to the nearest value; if the number falls midway it is rounded to the nearest value with an even (zero) least significant bit, which occurs 50% of the time (in IEEE 754r this mode is called "roundTiesToEven" to distinguish it from another round-to-nearest mode)
* Round toward 0 &ndash; directed rounding towards zero
* Round toward +infty &ndash; directed rounding towards positive infinity
* Round toward -infty &ndash; directed rounding towards negative infinity.

Extending the real numbers

The IEEE standard employs (and extends) the affinely extended real number system, with separate positive and negative infinities. During drafting, there was a proposal for the standard to incorporate the projectively extended real number system, with a single unsigned infinity, by providing programmers with a mode selection option. In the interest of reducing the complexity of the final standard, the projective mode was dropped, however. The Intel 8087 and Intel 80287 floating point co-processors both support this projective mode. [cite journal|journal=ACM Transactions on Programming Languages and Systems|volume=18|issue=2|date=March 1996|format=PDF|url=http://www.jhauser.us/publications/1996_Hauser_FloatingPointExceptions.html|author=John R. Hauser|title=Handling Floating-Point Exceptions in Numeric Programs] [cite journal|title=IEEE Task P754: A proposed standard for binary floating-point arithmetic|month=March|year=1981|journal=IEEE Computer|volume=14|issue=3|pages=51–62|author=David Stevenson] [cite journal|author= William Kahan and John Palmer|date=1979|title=On a proposed floating-point standard|journal=SIGNUM Newsletter|volume=14|issue=Special|pages=13–21]

Recommended functions and predicates

* Under some C compilers, copysign(x,y) returns x with the sign of y, so abs(x) equals copysign(x,1.0). This is one of the few operations which operates on a NaN in a way resembling arithmetic. The function copysign is new in the C99 standard.
* −x returns x with the sign reversed. This is different from 0−x in some cases, notably when x is 0. So −(0) is −0, but the sign of 0−0 depends on the rounding mode.
* scalb (y, N)
* logb (x)
* finite (x) a predicate for "x is a finite value", equivalent to −Inf < x < Inf
* isnan (x) a predicate for "x is a nan", equivalent to "x ≠ x"
* x <> y which turns out to have different exception behavior than NOT(x = y).
* unordered (x, y) is true when "x is unordered with y", i.e., either x or y is a NaN.
* class (x)
* nextafter(x,y) returns the next representable value from x in the direction towards y

ee also

* IEEE 754-2008
* −0 (negative zero)
* Intel 8087 (early implementation effort)
* minifloat for simple examples of properties of IEEE 754 floating point numbers
* Q (number format) For constant resolution

References

Further reading

*cite journal
author = Charles Severance
title = IEEE 754: An Interview with William Kahan
journal = IEEE Computer
year = 1998
month = March
volume = 31
issue = 3
pages = 114–115
doi = 10.1109/MC.1998.660194
url = http://www.freecollab.com/dr-chuck/papers/columns/r3114.pdf
accessdate = 2008-04-28
quote =

* cite journal
author = David Goldberg
title = What Every Computer Scientist Should Know About Floating-Point Arithmetic
journal = ACM Computing Surveys (CSUR)
year = 1991
month = March
volume = 23
issue = 1
pages = 5–48
doi = 10.1145/103162.103163
url = http://www.validlab.com/goldberg/paper.pdf
accessdate = 2008-04-28
quote =

* cite journal
author = Chris Hecker
title = Let's Get To The (Floating) Point
journal = Game Developer Magazine
year = 1996
month = February
pages = 19–24
issn = 1073-922X
url = http://www.d6.com/users/checker/pdfs/gdmfp.pdf
accessdate =
quote =

External links

* [http://hal.archives-ouvertes.fr/hal-00128124 A compendium of non-intuitive behaviours of floating-point on popular architectures] , with implications for program verification and testing
* [http://www.cygnus-software.com/papers/comparingfloats/comparingfloats.htm Comparing floats]
* [http://www.coprocessor.info/ Coprocessor.info: x87 FPU pictures, development and manufacturer information]
* [http://babbage.cs.qc.edu/courses/cs341/IEEE-754references.html IEEE 754 references]
* [http://www2.hursley.ibm.com/decimal/854mins.html IEEE 854-1987] &mdash; History and minutes
* [http://babbage.cs.qc.edu/IEEE-754/ Online IEEE 754 Calculators]


Wikimedia Foundation. 2010.

Игры ⚽ Нужен реферат?

Look at other dictionaries:

  • IEEE-754 — L’IEEE 754 est un standard pour la représentation des nombres à virgule flottante en binaire. Il est le plus employé actuellement pour le calcul des nombres à virgule flottante dans le domaine informatique, avec les CPU et les FPU. Le standard… …   Wikipédia en Français

  • Ieee 754 — L’IEEE 754 est un standard pour la représentation des nombres à virgule flottante en binaire. Il est le plus employé actuellement pour le calcul des nombres à virgule flottante dans le domaine informatique, avec les CPU et les FPU. Le standard… …   Wikipédia en Français

  • IEEE 754-2008 — IEEE 754 широко распространённый стандарт формата представления чисел с плавающей точкой, используемый как в программных реализациях арифметических действий, так и во многих аппаратных (CPU и FPU) реализациях. Многие компиляторы языков… …   Википедия

  • IEEE 754 — Die Norm IEEE 754 (ANSI/IEEE Std 754 1985; IEC 60559:1989 International version) definiert Standarddarstellungen für binäre Gleitkommazahlen in Computern und legt genaue Verfahren für die Durchführung mathematischer Operationen, insbesondere für… …   Deutsch Wikipedia

  • IEEE 754 — L’IEEE 754 est un standard pour la représentation des nombres à virgule flottante en binaire. Il est le plus employé actuellement pour le calcul des nombres à virgule flottante dans le domaine informatique, avec les CPU et les FPU. Le standard… …   Wikipédia en Français

  • IEEE 754-2008 — The IEEE Standard for Floating Point Arithmetic (IEEE 754) is the most widely used standard for floating point computation, and is followed by many hardware (CPU and FPU) and software implementations. Many computer languages allow or require that …   Wikipedia

  • IEEE 754 revision — This article describes the revision process of the IEEE 754 standard, 2000 2008, and the changes included in the revision. For a description of the standard itself, see IEEE 754 2008. IEEE 754 2008 (previously known as IEEE 754r ) was published… …   Wikipedia

  • IEEE 754-2008 — Der Standard IEEE 754 2008, der früherer Arbeitstitel lautete IEEE 754r, ist eine notwendig gewordene Revision des 1985 verabschiedeten Gleitkommastandards IEEE 754. Der alte Standard war sehr erfolgreich und wurde in zahlreichen Prozessoren und… …   Deutsch Wikipedia

  • IEEE 754r — ist eine notwendig gewordene Revision des vor etwa 20 Jahren (1985) verabschiedeten Gleitkommastandards IEEE 754. Der alte Standard war sehr erfolgreich und wurde in zahlreichen Prozessoren und Programmiersprachen übernommen. Die Diskussion über… …   Deutsch Wikipedia

  • IEEE punto flotante — Saltar a navegación, búsqueda El estándar de la IEEE para aritmética en coma flotante (IEEE 754) es el estándar más extendido para las computaciones en punto flotante, y es seguido por muchas de las mejoras de CPU y FPU. El estándar define… …   Wikipedia Español

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”