LECTURE 6 WELCOME - Utrecht University · LECTURE 6 WELCOME. 2. 3. 4 PART 1 THE CACHE. 5 cache. 6...

Post on 17-Jul-2020

2 views 0 download

Transcript of LECTURE 6 WELCOME - Utrecht University · LECTURE 6 WELCOME. 2. 3. 4 PART 1 THE CACHE. 5 cache. 6...

1

LECTURE 6

WELCOME

2

3

4

PART 1

THE CACHE

5

cache

6

cache

7

cache

8

Why is RAM slow?

Runs at a lower clockspeed;

Too far from the CPU

c = 300.000Km / s

at 4Ghz: 7.5cm per cycle

c in copper is lower

actually 5cm per cycle

2.5cm hence and forth

cache

9

Level 1 cache

Level 2 cache

Registers: 0 cycles

L1: 2 cycles

L2: 15 cycles

RAM: 80 cycles

cache

10

Level 1 cache

Level 2 cache

Registers: 0 cycles

L1: 4 cycles

L2: 11 cycles

L3: 39 cycles

RAM: 107 cycles

Level 3 cache

32KB

256KB

6MB

RAM: 107 cycles

cache

11

cache

CACHE

0 0050 411CBB372B37

1 0000 0A3246F3762B

2 0030 8910EE24BACF

3 0080 2AB348FE376C

RAM

0000 0A3246F3762B

0010 64000101EA67

0020 2BD634633642

0030 8910EE24BACF

0040 374C34648232

0050 411CBB372B37

0060 283E34A8623A

0070 A83829200176

0080 2AB348FE376C

Full associative cache

12

cache

CACHE

0 0050 411CBB372B37

1 0000 0A3246F3762B

2 0030 8910EE24BACF

3 0080 2AB348FE376C

Full associative cache

Retrieving data:

CPU wants to read from RAM

Cache searches for address

If found, data is returned

Otherwise, RAM is used

Obtained data is stored in cache

Writing data:

CPU wants to write to RAM

Cache searches for address

If found, data is written

Otherwise, new entry is created

Data to be written is stored in cache

Stored data is written to RAM ‘later’

13

cache

CACHE

line tag data

0000 0000 000000000000

0001 0000 000000000000

0002 1A50 8910EE24BACF

0003 0B70 2AB348FE376C

0004 0000 000000000000

0005 0000 000000000000

0006 0000 000000000000

0007 0000 000000000000

Set associative cache

14

cache

CACHE

line tag data

0000 0000 000000000000

0001 0000 000000000000

0002 1A50 8910EE24BACF

0003 0B70 2AB348FE376C

0004 0000 000000000000

0005 0000 000000000000

0006 0000 000000000000

0007 0000 000000000000

Set associative cache

Address: 0B700003

0003 0B70

line tag

Steps:

Split address in ‘line’ and ‘tag’

At cache line ‘line’, verify ‘tag’

If tag matches, return data

Otherwise, get data from RAM

15

cache

CACHE

line tag data

0000 0000 000000000000

0001 0000 000000000000

0002 1A50 8910EE24BACF

0003 0B70 2AB348FE376C

0004 0000 000000000000

0005 0000 000000000000

0006 0000 000000000000

0007 0000 000000000000

Set associative cache

Address: 0CA00006

0006 0CA0

line tag

Address: 098A0006

0006 098A

line tag

16

cache

N-Set associative cache

CACHE

line tag 1 data 1

0000 0000 000000000000

0001 0000 000000000000

0002 1A50 8910EE24BACF

0003 0B70 2AB348FE376C

0004 0000 000000000000

0005 0000 000000000000

0006 0000 000000000000

0007 0000 000000000000

CACHE

line tag 2 data 2

0000 0000 000000000000

0001 0000 000000000000

0002 0000 000000000000

0003 0FC0 1056BBA001FF

0004 0000 000000000000

0005 0000 000000000000

0006 0000 000000000000

0007 0000 000000000000

17

cache

Caching – Summary

Full associative cache:

Based on an address, we search through all cache lines to see if

the requested data is available. This kind of cache must be small,

or the number of tests is huge.

Set associative cache:

Based on the address, we determine the cache line where our data

could be. We check for that line only if the data is available. Data

that ends up in the same cache line will render the cache useless.

N-Set associative cache:

Every cache line can now hold N addresses. We need to check all

N tags, so N is small. However, several addresses sharing the

same cache line can still be cached.

18

cache

So… How does this affect your program?

1. 64 bytes per cache line:

2. 32Kb L1 cache, 8-way set associative:

3. Memory latency of 107 cycles:

4. Prefetching:

5. L1 instruction cache:

19

PART 2

TOTAL RECAP

20

21

22

“Dear Charles,

In almost every computation a

great variety of arrangements

for the succession of the

processes is possible, and various

considerations must influence

the selection amongst them

(...).

One essential object is to

choose that arrangement which

shall tend to reduce to a

minimum the time necessary for

completing the calculation.

Therefore, one should attend

PR3 and learn from it.

Love, Ada.”

23

10 TIPS straight from Ada Lovelace & Charles Babage!

“HOW TO PASS PR3”

(0. Read the slides once more.)

1. Chose your tools. (timer, compiler, SVN, Excell, etc.)

2. Measure & note. (original performance, scalability, time for various parts of the app)

3. Take a step back. (think, don’t type: what could be done smarter? – then research)

4. Resist the urge. (don’t touch that sqrtf yet. Improve algorithms instead)

5. Measure & note. (things changed radically, so measure again, and write down things)

6. Now give in to the urge. (go wild: Cache. Low level. Multithread.)

7. Measure. Note. (don’t forget! More results means a better report and a higher grade.)

8. Goto 6. (there’s always more to tweak. Mind diminishing returns though.)

9. Add some SIMD. It’s mandatory. (really. Don’t forget.)

10. Add polish. Hand in. (at least make it *look* professional, it really helps)

Wednesday in the exam week – By MAIL!

FRIDAY

THE END