Data Representation: Floating Point for Real Numbers Computer Organization and Assembly Language:...

17
Data Representation: Floating Point for Real Numbers Computer Organization and Assembly Language: Module 11

Transcript of Data Representation: Floating Point for Real Numbers Computer Organization and Assembly Language:...

Page 1: Data Representation: Floating Point for Real Numbers Computer Organization and Assembly Language: Module 11.

Data Representation: Floating Point for Real Numbers

Computer Organization and Assembly Language: Module 11

Page 2: Data Representation: Floating Point for Real Numbers Computer Organization and Assembly Language: Module 11.

Floating Point Representation The IEEE-754 Floating Point Standard is a widely used floating

point representation from among the many alternative formats The representation of floating point numbers contains:

a mantissa (variant of a scaled, sign magnitude integer) an exponent (8-bit, biased-127 integer)

In this way floating point representation resembles scientific notation Any number N can be represented as M*10^e, where

e = floor(log10N) M = N/(10^e)

1 < M < 10

Page 3: Data Representation: Floating Point for Real Numbers Computer Organization and Assembly Language: Module 11.

A number N represented in floating point is determined by the mantissa m, an exponent e, and its sign, s

N = (-1)s * m * 2e

If the sign is negative, s = 1. If the sign is positive, s = 0. The mantissa is normalized, i.e., 1 m < 2 In the IEEE-754 single precision format, the mantissa is

represented with 23 bits (only the fractional part is stored m = (+/-) 1.f22f21…f1f0

Double precision floating point works the same way, but the bit fields are larger: 1-bit sign, 11-bit exponent, 52 bits for the fractional part of the mantissa

Floating Point Representation

Page 4: Data Representation: Floating Point for Real Numbers Computer Organization and Assembly Language: Module 11.

Conversion to base-2

1.Break the decimal number into two parts: an integer and a fraction

2.Convert the integer into binary and place it to the left of the binary point

3.Convert the fraction into binary and place it to the right of the binary point

4.Write it in base-2 scientific notation and normalize

Page 5: Data Representation: Floating Point for Real Numbers Computer Organization and Assembly Language: Module 11.

Convert 22.625 to floating point representation

1. Convert 22 to binary. 2210 = 101102

2. Convert .625 to binary2*.625 = 1 + .252*.25 = 0 + .52*.5 = 1 + 0

3. Thus 22.62510 = 10110.1012

4. In base –2 scientific notation: 10110.101*20

Normalized form: 1.0110101*24

Example

.62510=.1012

Page 6: Data Representation: Floating Point for Real Numbers Computer Organization and Assembly Language: Module 11.

IEEE-754 SPFP Representation

Given the floating point representation N = (-1)s * m * 2e where m = 1.f22f21…f1f0

we can convert it to the IEEE-754 SPFP format using the relations:

F = (m-1)*223 (hence F is an integer) E = e + 127S = s

S E F

Page 7: Data Representation: Floating Point for Real Numbers Computer Organization and Assembly Language: Module 11.

Single-Precision Floating Point

The IEEE-754 single precision format has 32 bits distributed as

0 E 255, thus the actual exponent e (interpreted as biased-127) is restricted so that -127 e 128 But e = -127 and e = 128 have special meaning

S E F

1 8 23

Page 8: Data Representation: Floating Point for Real Numbers Computer Organization and Assembly Language: Module 11.

Special values and the hidden bit

In IEEE-754 , zero is represented by setting E = F = 0 regardless of the sign bit, thus there are two representations for zero: +0 and -0.

+ by S=0, E=255, F=0 - by S=1, E=255, F=0 NaN or Not-a-Number by E=255, F0

(may result from 0 divided by 0) The leading 1 in the fraction is not represented. It is

the hidden bit.

Page 9: Data Representation: Floating Point for Real Numbers Computer Organization and Assembly Language: Module 11.

Converting to IEEE-754 SPFP

1.Convert into a normalized base-2 representation

2.Bias the exponent. The result will be E.

3.Put the values into the correct field. Note that only the fractional part of the mantissa is stored in F.

Page 10: Data Representation: Floating Point for Real Numbers Computer Organization and Assembly Language: Module 11.

Example

Convert 22.625 to IEEE-754 SPFP format1. In scientific notation: 10110.101*20

Normalized form: 1.0110101*24

2. Bias the exponent: 4 + 127 = 131

13110 = 100000112

3. Place into the correct fields. S = 0 E = 10000011 F = 011 0101 0000 0000 0000 0000

100000110

S E F

01101010000000000000000

Page 11: Data Representation: Floating Point for Real Numbers Computer Organization and Assembly Language: Module 11.

Example

Convert 17.15 to IEEE FPS format 17.1510 = 10001.0010 0110 0110 0110 011*20

1. Normalized form: 1. 0001 0010 0110 0110 0110 011 * 24

2. Bias the exponent: 4 + 127 = 131

13110 = 100000112

3. Place into the correct fields. S = 0 E = 10000011 F = 000 1001 0011 0011 0011 0011

00010010011001100110011100000110

S E F

Page 12: Data Representation: Floating Point for Real Numbers Computer Organization and Assembly Language: Module 11.

Example

Convert -83.7 to IEEE FPS format (single precision)

2*.7 = 1 + .42*.4 = 0 + .82*.8 = 1 + .62*.6 = 1 + .22*.2 = 0 + .42*.4 = 0 + .82*.8 = 1 + .62*.6 = 1 + .22*.2 = 0 + .4. . .

-83.710=-1010011.101100110

Page 13: Data Representation: Floating Point for Real Numbers Computer Organization and Assembly Language: Module 11.

1. In binary scientific notation:

-1010011.10110011001100110 * 20

Normalized: -1.01001110110011001100110 * 26

2. Bias the exponent: 6 + 127 = 133

13310 = 100001012

3. Place into the correct fields

S = 1

E = 10000101

F = 01001110110011001100110

01001110110011001100110100001011

S E F

Page 14: Data Representation: Floating Point for Real Numbers Computer Organization and Assembly Language: Module 11.

Representing as hexadecimal

It is difficult for people to read binary one bit pattern looks much like another

Raw data, which is not being interpreted as representing a particular data type, is often displayed using hexadecimal instead of binary

The final step in many IEEE-754 SPFP problems will be to convert the result to hexadecimal 11000010101001110110011001100110 C2A76666

Page 15: Data Representation: Floating Point for Real Numbers Computer Organization and Assembly Language: Module 11.

Given a single precision floating point number with bit fields S, E, and F (interpreted as unsigned integers), the value of the number is normally calculated as

N = (-1)S(1 + F/223)2E-127

This interpretation is not used when E = 255 (+, - , or NaN) E = 0, F = 0 (+0 or –0) What about E 0, F 0?

Graceful underflow

Page 16: Data Representation: Floating Point for Real Numbers Computer Organization and Assembly Language: Module 11.

Given a single precision floating point number with bit fields S, E = 0, and F (interpreted as unsigned integers), the value of the number is calculated as

N = (-1)S(0 + F/223)2-126

This allows representation of numbers as small as 2-149, though each order of magnitude below 2-126 results in loss of one bit of precision.

Graceful underflow

Page 17: Data Representation: Floating Point for Real Numbers Computer Organization and Assembly Language: Module 11.

Normal interpretation: N = 2(1 – 127) = 2-126

24 bits of precision (counting the hidden bit)

E = 0 interpretation: N = 2-126 (.12) = 2-126 (.5) = 2-127

Only 23 bits of precision

E =0 interpretation: N = 2-126 (.00012) = 2-126(.0625) = 2-130

Only 20 bits of precision

Graceful underflow

0 00000001 00000000000000000000000

0 00000000 10000000000000000000000

0 00000000 00010000000000000000000