The Storage Hierarchy

1

The Storage Hierarchy

Registers Cache memory Main memory (RAM) Hard disk Removable media (CD, DVD etc) Internet

Fast, expensive, few

Slow, cheap, a lot

What is Main Memory?

Where data reside for a program that is currently running Sometimes called RAM (Random Access Memory): you can

load from or store into any main memory location at any time Much slower => much cheaper => much bigger

2

What Main Memory Looks Like

3

…0 1 2 3 4 5 6 7 8 9 10

536,870,911

You can think of main memory as a big long 1D array of bytes.

Cache

[4]

Cache

Caching is a technology based on the memory subsystem of your computer. The main purpose of a cache is to accelerate your computer while keeping the price of the computer low. Caching allows you to do your computer tasks more rapidly.

5

http://computer.howstuffworks.com/pc.htm

Librarian Example Librarian behind the desk to give you the books you ask for

– he fetches it for you from a set of stacks in the storeroom You ask for the book Moby Dick The librarian goes into the storeroom, gets the book, returns

to the counter and gives the book to the customer. Later, the client comes back to return the book. The

librarian takes the book and returns it to the storeroom. He then returns to his counter waiting for another customer.

Let's say the next customer asks for Moby Dick . The librarian then has to return to the storeroom to get the book he recently handled and give it to the client

6

Librarian Example Cont. Under this model, the librarian has to make a complete

round trip to fetch every book -- even very popular ones that are requested frequently. Is there a way to improve the performance of the librarian?

7

Librarian Example with Cache Let's give the librarian a backpack into which he will be

able to store 10 books (in computer terms, the librarian now has a 10-book cache).

In this backpack, he will put the books the clients return to him, up to a maximum of 10.

The day starts. The backpack of the librarian is empty. Our first client arrives and asks for Moby Dick.

The librarian has to go to the storeroom to get the book. He gives it to the client.

8

Librarian Example With Cache Later, the client returns and gives the book back to the

librarian. Instead of returning to the storeroom to return the book, the

librarian puts the book in his backpack and stands there (he checks first to see if the bag is full).

Another client arrives and asks for Moby Dick. Before going to the storeroom, the librarian checks to see if this title is in his backpack. He finds it! All he has to do is take the book from the backpack and give it to the client. There's no journey into the storeroom, so the client is served more efficiently.

9

Not in BackpackWhat if the client asked for a title not in the cache (the

backpack)? In this case, the librarian is less efficient with a cache than

without one, because the librarian takes the time to look for the book in his backpack first.

Even in our simple librarian example, the latency time (the waiting time) of searching the cache is so small compared to the time to walk back to the storeroom that it is irrelevant.

The cache is small (10 books), and the time it takes to notice a miss is only a tiny fraction of the time that a journey to the storeroom takes.

10

What is Cache? A special kind of memory where data reside that are

about to be used or have just been used.

Very fast => very expensive => very small (typically 100 to 10,000 times as expensive as RAM per byte)

Data in cache can be loaded into or stored from registers at speeds comparable to the speed of performing computations.

Data that are not in cache (but that are in Main Memory) take much longer to load or store.

Cache is near the CPU: either inside the CPU or on the motherboard that the CPU sits on.

11

The Relationship Between

Main Memory & Cache

13

RAM is SlowCPUThe speed of data transfer

between Main Memory and theCPU is much slower than thespeed of calculating, so the CPUspends most of its time waitingfor data to come in or go out.

From Cache to the CPU

14

Typically, data move between cache and the CPU at speeds relatively near to that of the CPU performing calculations.

CPU

Cache

15

Why Have Cache?CPUCache is much closer to the speed

of the CPU, so the CPU doesn’thave to wait nearly as long forstuff that’s already in cache:it can do moreoperations per second!

Important Facts about Cache Cache technology is the use of a faster but smaller memory

type to accelerate a slower but larger memory type. When using a cache, you must check the cache to see if an

item is in there. If it is there, it's called a cache hit. If not, it is called a cache miss and the computer must wait for a round trip from the larger, slower memory area.

A cache has some maximum size that is much smaller than the larger storage area.

16

Facts about Cache cont. It is possible to have multiple layers of cache. With our librarian example, the smaller but faster memory

type is the backpack, and the storeroom represents the larger and slower memory type. This is a one-level cache.

There might be another layer of cache consisting of a shelf that can hold 100 books behind the counter. The librarian can check the backpack, then the shelf and then the storeroom. This would be a two-level cache.

17

How Cache WorksWhen you request data from a particular address in Main

Memory, here’s what happens:

1. The hardware checks whether the data for that address is already in cache. If so, it uses it.

1. Otherwise, it loads from Main Memory the entire cache line that contains the address.

18

If It’s in Cache, It’s Also in RAMIf a particular memory address is currently in cache, then it’s

also in Main Memory (RAM).

That is, all of a program’s data are in Main Memory, but some are also in cache.

19

Cache Use Jargon Cache Hit: the data that the CPU needs right now are

already in cache. (remember librarian example) Cache Miss: the data that the CPU needs right now are

not currently in cache.

If all of your data are small enough to fit in cache, then when you run your program, you’ll get almost all cache hits (except at the very beginning), which means that your performance could be excellent!

Sadly, this rarely happens in real life: most problems of scientific or engineering interest are bigger than just a few MB.

20

Cache Lines A cache line is a small, contiguous region in cache,

corresponding to a contiguous region in RAM of the same size, that is loaded all at once. Typical size of cache lines: 32 to 1024 bytes

Cache reuse: the program is much more efficient if all of the data and instructions fit in cache; try to use what’s in cache a lot before using anything that isn’t in cache (e.g., tiling)

21

Improving Your Cache Hit Rate

Many scientific codes use a lot more data than can fit in cache all at once.

Therefore, you need to ensure a high cache hit rate even though you’ve got much more data than cache.

So, how can you improve your cache hit rate? Use the same solution as in Real Estate:

Location, Location, Location!

22

Cache Example 1 Cache Line Size (CLS) = 32 bytes Cache Size (CS) = 64 lines Direct Mapping Each element is 4 bytes longreal a (1024, 100)…

do i=1, 1024 do j=1, 100 a(i,j) = a(i,j) + 1

end do end do Cache misses = 100*1024 = 102,400

23

Example 1 using Loop Interchange

Interchange I and Jreal a (1024, 100)…

do j=1, 1024 do i=1, 100 a(i,j) = a(i,j) + 1

end do end do Takes advantage of spatial locality to reduce cache misses Cache misses = 102,400/8 = 12,800, number of cache lines

occupied by the array a

24

25

Tiling

26

Tiling Tile: A small rectangular subdomain of a problem domain.

Sometimes called a block or a chunk. Tiling: Breaking the domain into tiles. Tiling strategy: Operate on each tile to completion, then

move to the next tile. Tile size can be set at runtime, according to what’s best for

the machine that you’re running on.

Example 2do j=1, 100

do i=1, 4096 a(i) = a(i)*a(i)

end doend do Cache misses = (4096/8)* 100 After accessing elements a(1..512) the cache is full and a

new cache line has to be written back to memory before a new line is written to cache

Each element a(i) is reused 100 times we can apply TILING to reduce the number of cache misses

27

Example 2 using Tilingdo ii= 1, 4096, B do j=1, 100 do i=ii, min(ii+B-1,4096)

a(i) = a(i) + a(i) end doend do

The cache can accommodate 512 elements of a.Thus, if we choose B=512, cache misses = 8

28

29

A Sample ApplicationMatrix-Matrix MultiplyLet A, B and C be matrices of sizesnr nc, nr nk and nk nc, respectively:

ncnrnrnrnr

nc

nc

nc

aaaa

aaaaaaaaaaaa

,3,2,1,

,33,32,31,3

,23,22,21,2

,13,12,11,1

A

nknrnrnrnr

nk

nk

nk

bbbb

bbbbbbbbbbbb

,3,2,1,

,33,32,31,3

,23,22,21,2

,13,12,11,1

B

ncnknknknk

nc

nc

nc

cccc

cccccccccccc

,3,2,1,

,33,32,31,3

,23,22,21,2

,13,12,11,1

C

30

Matrix Multiply: Naïve VersionSUBROUTINE matrix_matrix_mult_naive (dst, src1, src2, & & nr, nc, nq) IMPLICIT NONE INTEGER,INTENT(IN) :: nr, nc, nq REAL,DIMENSION(nr,nc),INTENT(OUT) :: dst REAL,DIMENSION(nr,nq),INTENT(IN) :: src1 REAL,DIMENSION(nq,nc),INTENT(IN) :: src2

INTEGER :: r, c, q

DO c = 1, nc DO r = 1, nr dst(r,c) = 0.0 DO q = 1, nq dst(r,c) = dst(r,c) + src1(r,q) * src2(q,c) END DO END DO END DOEND SUBROUTINE matrix_matrix_mult_naive

31

Multiplying Within a TileSUBROUTINE matrix_matrix_mult_tile (dst, src1, src2, nr, nc, nq, & & rstart, rend, cstart, cend, qstart, qend) IMPLICIT NONE INTEGER,INTENT(IN) :: nr, nc, nq REAL,DIMENSION(nr,nc),INTENT(OUT) :: dst REAL,DIMENSION(nr,nq),INTENT(IN) :: src1 REAL,DIMENSION(nq,nc),INTENT(IN) :: src2 INTEGER,INTENT(IN) :: rstart, rend, cstart, cend, qstart, qend

INTEGER :: r, c, q

DO c = cstart, cend DO r = rstart, rend IF (qstart == 1) dst(r,c) = 0.0 DO q = qstart, qend dst(r,c) = dst(r,c) + src1(r,q) * src2(q,c) END DO END DO END DOEND SUBROUTINE matrix_matrix_mult_tile

32

The Advantages of Tiling It allows your code to exploit data locality better, to get

much more cache reuse: your code runs faster! It’s a relatively modest amount of extra coding (typically a

few wrapper functions and some changes to loop bounds). If you don’t need tiling – because of the hardware, the

compiler or the problem size – then you can turn it off by simply setting the tile size equal to the problem size.

33

Will Tiling Always Work?Tiling WON’T always work. Why?

Well, tiling works well when: the order in which calculations occur doesn’t matter much,

AND there are lots and lots of calculations to do for each memory

movement.

If either condition is absent, then tiling may not help.

The Storage Hierarchy

Documents

Transcript of The Storage Hierarchy