Floating-Point Format

download Floating-Point Format

of 25

Transcript of Floating-Point Format

  • 8/14/2019 Floating-Point Format

    1/25

    CS220

    April 11, 2007

  • 8/14/2019 Floating-Point Format

    2/25

    Floating-Point Format Scientific Notation

    Coefficient/mantissa, exponent

    Decimal

    Example: 2.429843 x 105, 7.3434 x 10-3

    Binary

    Example: 1.0111 x 22 => 101.11 (in binary)

    22

    x1+21

    x0+20

    x1+2-1

    x1+2-2

    x1

  • 8/14/2019 Floating-Point Format

    3/25

    IEEE 754 Floating-Point Format Components

    Sign (1 negative, 0 positive)

    Significand/Coefficient/Mantissa/Fraction

    Normalized or Demormalized

    Exponent (positive unsigned, biased)

  • 8/14/2019 Floating-Point Format

    4/25

    Exponent Bias the value of exponent is offset from the actual

    value two's complement makes comparison harder

    adjusting its value to put it within an unsigned

    range suitable for comparison, biased by 2e-1

    -1(Here e is the size of exponent part)

    For a single-precision, an exponent in the range

    -126 to +127 is biased by adding 127 to get avalue in the range 1 to 254.0 reserved for denormalized num or zero

    255 reserved for infinity or NaN

  • 8/14/2019 Floating-Point Format

    5/25

    Comparison

    00000000

    11111111

    10000000

    01111111

    255

    0

    127

    128

    127

    +0 0

    127

    -0

    -127

    -128

    -1

    -127

    128

    0

    1

    Unsigned Ones complement Twos complement Biased

  • 8/14/2019 Floating-Point Format

    6/25

    Precision 32 bits, Single-Precision (1,8,23)

    (1.18x10-38 to 3.40x1038)

    64 bits, Double-Precision (1,11,52)

    (2.23x10-308

    to 1.79x10308

    ) 80 bits, Double-Extended-Precision

    (1,15,64)

    Intel format, not IEEE standard

    (3.37x10-4932 to 1.18x104932)

  • 8/14/2019 Floating-Point Format

    7/25

    Single Precision Exponent is Biased by 28-1-1=127

    Represents -126 to 127

    In the example shown above, the sign is zero,the exponent is -3, and the significand is 1.01

    (in binary, which is 1.25 in decimal). Therepresented number is therefore +1.25x2-3,which is +0.15625.

  • 8/14/2019 Floating-Point Format

    8/25

    Single Precision Number Ranges The smallest non-zero positive and largest non-zero

    negative numbers (represented by the denormalized

    value with all 0s in the Exp field and the binary value 1in the Fraction field) are

    21491.4012985 x 1045

    The smallest non-zero positive and largest non-zero

    negative normalized numbers (represented with thebinary value 1 in the Exp field and 0 in the Fractionfield) are

    21261.175494351 x1038

    The largest finite positive and smallest finite negativenumbers (represented by the value with 254 in the Expfield and all 1s in the Fraction field) are

    (2128 - 2104)3.4028235 x 1038

  • 8/14/2019 Floating-Point Format

    9/25

    Example Encode the decimal number -118.625 using the IEEE 754 system

    First we need to get the sign, the exponent and the fraction. Because

    it is a negative number, the sign is "1". Now, we write the number (without the sign; i.e. unsigned, no two's

    complement) using binary notation. The result is 1110110.101. Next, let's move the radix point left, leaving only a 1 at its left:

    1110110.101 = 1.110110101 x 26. This is a normalized floating pointnumber. The mantissa is the part at the right of the radix point, filled

    with 0 on the right until we get all 23 bits. That is11011010100000000000000.

    The exponent is 6, but we need to convert it to binary and bias it (sothe most negative exponent is 0, and all exponents are non-negativebinary numbers). For the 32-bit IEEE 754 format, the bias is 127 andso 6 + 127 = 133. In binary, this is written as 10000101.

  • 8/14/2019 Floating-Point Format

    10/25

    Double Precision Exponent is Biased by 211-1-1=1023

    Represents -1022 to 1023

  • 8/14/2019 Floating-Point Format

    11/25

    Double Precision Number Ranges The smallest non-zero positive and largest non-zero

    negative numbers (represented by the denormalized

    value with all 0s in the Exp field and the binary value 1in the Fraction field) are

    21074510324

    The smallest non-zero positive and largest non-zero

    negative normalized numbers (represented by thevalue with the binary value 1 in the Exp and 0 in theFraction field) are

    210222.225073858507202010308

    The largest finite positive and smallest finite negativenumbers (represented by the value with 2046 in theExp field and all 1s in the Fraction field) are

    (21024 2971)1.797693134862315710308

  • 8/14/2019 Floating-Point Format

    12/25

    Special Cases zero is not directly representable in the straight format,

    due to the assumption of a leading 1 (we'd need to

    specify a true zero mantissa to yield a value of zero).Zero is a special value denoted with an exponent field ofzero and a fraction field of zero.

    If the exponent is all 0s, but the fraction is non-zero (else

    it would be interpreted as zero), then the value is adenormalizednumber, which does nothave an assumedleading 1 before the binary point. Thus, this represents anumber (-1)sx 0.fx 2-126, where sis the sign bit and fisthe fraction. For double precision, denormalizednumbers are of the form (-1)sx 0.fx 2-1022. From this youcan interpret zero as a special type of denormalizednumber.

  • 8/14/2019 Floating-Point Format

    13/25

    Special Cases cont The values +infinity and -infinity are denoted

    with an exponent of all 1s and a fraction of all 0s.The sign bit distinguishes between negativeinfinity and positive infinity. Being able to denoteinfinity as a specific value is useful because it

    allows operations to continue past overflowsituations. The value NaN (Not a Number) is used to

    represent a value that does not represent a real

    number. NaN's are represented by a bit patternwith an exponent of all 1s and a non-zerofraction.

  • 8/14/2019 Floating-Point Format

    14/25

    Special Cases SummaryType Exponent Mantissa

    Zeroes 0 0(Positive/Negative Zero depends on sign)

    Denormalized 0 non zero

    Normalized 1 to 2e-2 anyInfinities 2e-1 0

    (Positive/Negative Infinity depends on sign)

    NaNs 2e-1 non zero

    (here e is size of exponent)

  • 8/14/2019 Floating-Point Format

    15/25

    Special Operations Overflow: exponent too large, producing an

    infinity. Underflow: exponent too small, producing a

    denorm or zero.

    Zerodivide: nonzero number is divided by zero,producing an infinity of the appropriate sign.

    Operand Error: such as such as division of zero

    by zero, or taking the square root of -1,producing a NaN

  • 8/14/2019 Floating-Point Format

    16/25

    Special OperationsOperation Result

    nInfinity 0InfinityInfinity Infinity

    nonzero 0 Infinity

    Infinity + Infinity Infinity00 NaN

    Infinity - Infinity NaN

    InfinityInfinity NaN

    Infinity 0 NaN

  • 8/14/2019 Floating-Point Format

    17/25

    FPU Coprocessor: supplement the functions

    of the primary processor. Coprocessor examples: floating point

    arithmetic, graphics, signal processing,string processing, or encryption.

    FPU registers: eight 80-bit data registers,

    three 16-bit registers (control, status, andtag)

  • 8/14/2019 Floating-Point Format

    18/25

    FPU Register Stack Circular, top is defined in control register

    %st(n)

  • 8/14/2019 Floating-Point Format

    19/25

  • 8/14/2019 Floating-Point Format

    20/25

    Preset Values

    FLD1 Push 1.0

    FLDL2T Push Log210

    FLDL2E Push Log2e

    FLDPI Push Pi FLDLG2 Push Log102

    FLDLN2 Push Ln2 (Loge2) FLDZ Push 0.0

  • 8/14/2019 Floating-Point Format

    21/25

  • 8/14/2019 Floating-Point Format

    22/25

    Status Register Indicates the operating condition of the FPU

    Status Bit Description0 Invalid operation exception flag1 Denormalized operand exception flag2 Zero divide exception flag3 Overflow exception flag4 Underflow exception flag5 Precision exception flag6 Stack fault7 Error summary status

    8 Condition code bit 0 (C0)9 Condition code bit 1 (C1)10 Condition code bit 2 (C2)11-13 Top of stack pointer14 Condition code bit 3 (C3)

    15 FPU busy flag fstsw oldvaluefldsw newvalue

  • 8/14/2019 Floating-Point Format

    23/25

    Control Register controls the FPU functions, such as

    calculation precision, and rounding methodStatus Bit Description0 Invalid operation exception mask1 Denormal operand exception mask

    2 Zero divide exception mask3 Overflow exception mask4 Underflow exception mask5 Precision exception mask6-7 Reserved

    8-9 Precision control10-11 Rounding control12 Infinity control13-15 Reserved

    fstcw oldvaluefldcw newvalue

  • 8/14/2019 Floating-Point Format

    24/25

    Control Register Precision Control

    00 -- single-precision (24-bit significand) 01 -- not used

    10 -- double-precision (53-bit significand)

    11 -- double-extended-precision (64-bit significand)

    Rounding Control 00 -- round to nearest

    01 -- round down (toward negative infinity)

    10 -- round up (toward positive infinity)

    11 -- round toward zero

  • 8/14/2019 Floating-Point Format

    25/25

    Tag Register Identify the values within the eight 80-bit

    FPU data registers. (2 bits per register) A valid double-extended-precision value (code 00)

    A zero value (code 01)

    A special floating-point value (code 10) Nothing (empty) (code 11)

    fsttw oldvaluefldtw newvalue