Ellen Spertus MCS 111 October 11, 2001 Floating Point Arithmetic.

Ellen Spertus

MCS 111

October 11, 2001

Floating Point Arithmetic

2

Decimal addition (1)

• Problem: 9.999×101 + 1.610×10-1

• Estimate answer:

3


• Problem: 9.999×101 + 1.610×10-1

• Calculate answer:

9.999×101

+1.610×10-1

4


• Problem: 9.999×101 + 1.610×10-1

• How should we add them?

5

Floating point addition

• Adjust numbers to have same exponent

• Add the significands

• Normalize the sum

6

Binary addition (1)

• Problem: 1.01×22 + 1.101×2-1

• Adjust numbers to have same exponent:



7

Binary addition (2)

• Problem: 1.11×21 + 1.01×23

• Adjust numbers to have same exponent:



8

8-bit floating-point format (2)

• Exponent (3 bits) is biased by 3

• The leading one of significand is implicit

• Zero is represented by all zeros

sign 1 bit

exponent 3 bits

significand 4 bits

number base 2

number base 10

0 100 0000 0 000 1000

9

Practice

Add two numbers from previous slide

sign 1 bit

exponent 3 bits

significand 4 bits

number base 2

number base 10

0 100 0000 0 000 1000

10

Problem

11

Rounding (1)

• Round 1.00011 to have one fewer digit

• Modes– Always round up (IRS)– Always round down– Truncate– Round to nearest even

12

Rounding (2)

• Round -1.00011 to have one fewer digit

• Modes– Always round up (IRS)– Always round down– Truncate– Round to nearest even

13

Ensuring accurate results

• Our significands are 4 bits wide.

• We use 6 bits when adding two significands.– Guard bit– Round bit

• Purpose: Accurate rounding

14

Adding large numbers

• What if we add 1.1111×24 + 1.1111×24

15

How can we get underflow?

16

Associativity of arithmetic

• (x+y)+z = x+(y+z)

• When is this true?

17

Breakdown of associativity

• Values– x = 1.0000– y = 0.00001– z = 0.00001

Assume rounding by truncation.

• (x+y)+z • x+(y+z)

18

MIPS floating point

• 32 floating-point registers (32 bits each)• Instructions

– Addition: add.s, add.d– Subtraction: sub.s, sub.d– Multiplication: mul.s, mul.d– Division: div.s, div.d– Comparison: c.x.s and c.x.d where x is:

eq, neq, lt, le, gt, ge

– Conditional branch: bc1t, bc1f

19

Summary

• Computers aren’t limited to integers

• Floating-point arithmetic is quirky– Loss of precision due to rounding– Underflow– Overflow

• Big picture: Floating point arithmetic can be implemented with enough ______________________.

Ellen Spertus MCS 111 October 11, 2001 Floating Point Arithmetic.

Documents

Transcript of Ellen Spertus MCS 111 October 11, 2001 Floating Point Arithmetic.