Floating-Point Format

8/14/2019 Floating-Point Format

1/25

CS220

April 11, 2007


2/25

Floating-Point Format Scientific Notation

Coefficient/mantissa, exponent

Decimal

Example: 2.429843 x 105, 7.3434 x 10-3

Binary

Example: 1.0111 x 22 => 101.11 (in binary)

22

x1+21

x0+20

x1+2-1

x1+2-2

x1


3/25

IEEE 754 Floating-Point Format Components

Sign (1 negative, 0 positive)

Significand/Coefficient/Mantissa/Fraction

Normalized or Demormalized

Exponent (positive unsigned, biased)


4/25

Exponent Bias the value of exponent is offset from the actual

value two's complement makes comparison harder

adjusting its value to put it within an unsigned

range suitable for comparison, biased by 2e-1

-1(Here e is the size of exponent part)

For a single-precision, an exponent in the range

-126 to +127 is biased by adding 127 to get avalue in the range 1 to 254.0 reserved for denormalized num or zero

255 reserved for infinity or NaN


5/25

Comparison

00000000

11111111

10000000

01111111

255

0

127

128

127

+0 0

127

-0

-127

-128

-1

-127

128

0

1

Unsigned Ones complement Twos complement Biased


6/25

Precision 32 bits, Single-Precision (1,8,23)

(1.18x10-38 to 3.40x1038)

64 bits, Double-Precision (1,11,52)

(2.23x10-308

to 1.79x10308

) 80 bits, Double-Extended-Precision

(1,15,64)

Intel format, not IEEE standard

(3.37x10-4932 to 1.18x104932)


7/25

Single Precision Exponent is Biased by 28-1-1=127

Represents -126 to 127

In the example shown above, the sign is zero,the exponent is -3, and the significand is 1.01

(in binary, which is 1.25 in decimal). Therepresented number is therefore +1.25x2-3,which is +0.15625.


8/25

Single Precision Number Ranges The smallest non-zero positive and largest non-zero

negative numbers (represented by the denormalized

value with all 0s in the Exp field and the binary value 1in the Fraction field) are

21491.4012985 x 1045

The smallest non-zero positive and largest non-zero

negative normalized numbers (represented with thebinary value 1 in the Exp field and 0 in the Fractionfield) are

21261.175494351 x1038

The largest finite positive and smallest finite negativenumbers (represented by the value with 254 in the Expfield and all 1s in the Fraction field) are

(2128 - 2104)3.4028235 x 1038


9/25

Example Encode the decimal number -118.625 using the IEEE 754 system

First we need to get the sign, the exponent and the fraction. Because

it is a negative number, the sign is "1". Now, we write the number (without the sign; i.e. unsigned, no two's

complement) using binary notation. The result is 1110110.101. Next, let's move the radix point left, leaving only a 1 at its left:

1110110.101 = 1.110110101 x 26. This is a normalized floating pointnumber. The mantissa is the part at the right of the radix point, filled

with 0 on the right until we get all 23 bits. That is11011010100000000000000.

The exponent is 6, but we need to convert it to binary and bias it (sothe most negative exponent is 0, and all exponents are non-negativebinary numbers). For the 32-bit IEEE 754 format, the bias is 127 andso 6 + 127 = 133. In binary, this is written as 10000101.


10/25

Double Precision Exponent is Biased by 211-1-1=1023

Represents -1022 to 1023


11/25

Double Precision Number Ranges The smallest non-zero positive and largest non-zero

negative numbers (represented by the denormalized

value with all 0s in the Exp field and the binary value 1in the Fraction field) are

21074510324

The smallest non-zero positive and largest non-zero

negative normalized numbers (represented by thevalue with the binary value 1 in the Exp and 0 in theFraction field) are

210222.225073858507202010308

The largest finite positive and smallest finite negativenumbers (represented by the value with 2046 in theExp field and all 1s in the Fraction field) are

(21024 2971)1.797693134862315710308


12/25

Special Cases zero is not directly representable in the straight format,

due to the assumption of a leading 1 (we'd need to

specify a true zero mantissa to yield a value of zero).Zero is a special value denoted with an exponent field ofzero and a fraction field of zero.

If the exponent is all 0s, but the fraction is non-zero (else

it would be interpreted as zero), then the value is adenormalizednumber, which does nothave an assumedleading 1 before the binary point. Thus, this represents anumber (-1)sx 0.fx 2-126, where sis the sign bit and fisthe fraction. For double precision, denormalizednumbers are of the form (-1)sx 0.fx 2-1022. From this youcan interpret zero as a special type of denormalizednumber.


13/25

Special Cases cont The values +infinity and -infinity are denoted

with an exponent of all 1s and a fraction of all 0s.The sign bit distinguishes between negativeinfinity and positive infinity. Being able to denoteinfinity as a specific value is useful because it

allows operations to continue past overflowsituations. The value NaN (Not a Number) is used to

represent a value that does not represent a real

number. NaN's are represented by a bit patternwith an exponent of all 1s and a non-zerofraction.


14/25

Special Cases SummaryType Exponent Mantissa

Zeroes 0 0(Positive/Negative Zero depends on sign)

Denormalized 0 non zero

Normalized 1 to 2e-2 anyInfinities 2e-1 0

(Positive/Negative Infinity depends on sign)

NaNs 2e-1 non zero

(here e is size of exponent)


15/25

Special Operations Overflow: exponent too large, producing an

infinity. Underflow: exponent too small, producing a

denorm or zero.

Zerodivide: nonzero number is divided by zero,producing an infinity of the appropriate sign.

Operand Error: such as such as division of zero

by zero, or taking the square root of -1,producing a NaN


16/25

Special OperationsOperation Result

nInfinity 0InfinityInfinity Infinity

nonzero 0 Infinity

Infinity + Infinity Infinity00 NaN

Infinity - Infinity NaN

InfinityInfinity NaN

Infinity 0 NaN


17/25

FPU Coprocessor: supplement the functions

of the primary processor. Coprocessor examples: floating point

arithmetic, graphics, signal processing,string processing, or encryption.

FPU registers: eight 80-bit data registers,

three 16-bit registers (control, status, andtag)


18/25

FPU Register Stack Circular, top is defined in control register

%st(n)


19/25


20/25

Preset Values

FLD1 Push 1.0

FLDL2T Push Log210

FLDL2E Push Log2e

FLDPI Push Pi FLDLG2 Push Log102

FLDLN2 Push Ln2 (Loge2) FLDZ Push 0.0


21/25


22/25

Status Register Indicates the operating condition of the FPU

Status Bit Description0 Invalid operation exception flag1 Denormalized operand exception flag2 Zero divide exception flag3 Overflow exception flag4 Underflow exception flag5 Precision exception flag6 Stack fault7 Error summary status

8 Condition code bit 0 (C0)9 Condition code bit 1 (C1)10 Condition code bit 2 (C2)11-13 Top of stack pointer14 Condition code bit 3 (C3)

15 FPU busy flag fstsw oldvaluefldsw newvalue


23/25

Control Register controls the FPU functions, such as

calculation precision, and rounding methodStatus Bit Description0 Invalid operation exception mask1 Denormal operand exception mask

2 Zero divide exception mask3 Overflow exception mask4 Underflow exception mask5 Precision exception mask6-7 Reserved

8-9 Precision control10-11 Rounding control12 Infinity control13-15 Reserved

fstcw oldvaluefldcw newvalue


24/25

Control Register Precision Control

00 -- single-precision (24-bit significand) 01 -- not used

10 -- double-precision (53-bit significand)

11 -- double-extended-precision (64-bit significand)

Rounding Control 00 -- round to nearest

01 -- round down (toward negative infinity)

10 -- round up (toward positive infinity)

11 -- round toward zero


25/25

Tag Register Identify the values within the eight 80-bit

FPU data registers. (2 bits per register) A valid double-extended-precision value (code 00)

A zero value (code 01)

A special floating-point value (code 10) Nothing (empty) (code 11)

fsttw oldvaluefldtw newvalue

Floating-Point Format

Documents

Transcript of Floating-Point Format