Floating Point Numbers
-
Upload
jolene-riley -
Category
Documents
-
view
18 -
download
0
description
Transcript of Floating Point Numbers
Floating Point Numbers
Muddsar JamilCS 147
Introduction & Representation
Provides the ability to represent very large numbers, as well as very small numbers.
Example: 1 Trillion = 40 bits to left of radix pt. Retaining as much precision as needed
increase calculation efficiency.
A great deal of extra hardware is required in order to store/manipulate numbers with 80 bits or more.
Computer Representation of Floating Point Numbers
_ _ _ _ . _ _ _ _ _ _ _ _ _ _ _ _
This format makes for easier comparison: =, =/=, <=, ≥
Example: Convert (358)10
in to the above format to be used as a floating point number.
Java: Float x = new Float(358.0f)
Sign Bit0 = +1 = -
3-bitexponent
Three base 16 digits
Example Continued
First step is to convert 358 from base 10 to 16. Using Horner's method:
358/16 = 22 --- R 6 22/16 = 1 --- R 6 358
10 = 166
16 Next, convert to floating-point and Normalize
(166)10
= (166.)16
x 160
Normalize: ( .166 )16
x 163
The exponent is 3, but we represent it in excess 4: 0 1 1 (+3)
10 Excess 4 + 1 0 0 (+4)
10
= 1 1 1
0 1 1 1 . 0 0 0 1 0 1 1 0 0 1 1 0 + 3 1 6 6
Sign Expon. Fraction
Fractional -> Fixed Point Conversion
Convert (XYZ.375)10
to Binary First, convert XYZ using Horner's method. Next, Convert the .375
10 as following:
.375 x 2 = 0.75 .75 x 2 = 1.5 .5 x 2 = 1.0
So (.375)10
= (.011)2
Most Significant Bit
Least Significant Bit
IEEE 754 Floating Point Standard
Created in 1985 to ensure standard representation among different systems.
Most new architectures support IEEE 754.
Two Formats: Single Precision
1 8 23
Double Precision
=32 bits totalSign Expon. Fraction
=64 bits1 11 52Sign Expon. Fraction
IEEE 754 Representations
Can Represent (among others) Non-zero, normalized numbers
Clean zero All 0s in exponent and fraction Sign bit can be 0, or 1, to represent +0 or -0
Infinity / Overflow / NaN Exponent contains all 1s, Fraction is all 0s Sign bit can be 0, or 1 0 / 0; Sqrt(-1);