1998 Morgan Kaufmann Publishers Mario Côrtes - MO401 - IC/Unicamp- 2002s1 Ch7-1 Chapter Seven...

1998 Morgan Kaufmann PublishersMario Côrtes - MO401 - IC/Unicamp- 2002s1 Ch7-1

Chapter SevenSistemas de Memória


• SRAM:

– value is stored on a pair of inverting gates

– very fast but takes up more space than DRAM (4 to 6 transistors)

• DRAM:

– value is stored as a charge on capacitor (must be refreshed)

– very small but slower than SRAM (factor of 5 to 10)

Memories: Review

data data

sel

Capacitor

Pass transistor

Word line

Bit line

Ver arquivo: revisão de conceitos de memória


• Users want large and fast memories!

SRAM access times are 2 - 25ns at cost of $100 to $250 per Mbyte.DRAM access times are 60-120ns at cost of $5 to $10 per Mbyte.Disk access times are 10 to 20 million ns at cost of $.10 to $.20 per Mbyte.

• Try and give it to them anyway

– build a memory hierarchy

Exploiting Memory Hierarchy

1997

CPU

Level n

Level 2

Level 1

Levels in thememory hierarchy

Increasing distance from the CPU in

access time

Size of the memory at each level


Custo (ci $/bit)

maior

menor

Memory Hierarchy

CPU

Memória

Memória

Memória

Velocidade

rápida

lenta

Tamanho(Si)

menor

maior

b1

b2

b3


Hierarquia de memória (custo e velocidade)

• Custo médio do sistema ($/bit)

S1 C1 + S2 23 + …… + Sn Cn S1 + S2 + …… + Sn

• Objetivos do sistema

– Custo médio custo do nível mais barato (disco)

– Velocidade do sistema velocidade do mais rápido (cache)

• Hoje, assumindo disco 40 GB e memória de 256 MB

– Calcular o custo médio por bit


Locality

• A principle that makes having a memory hierarchy a good idea

• If an item is referenced,

temporal locality: it will tend to be referenced again soon

spatial locality: nearby items will tend to be referenced soon.

Why does code have locality?

• Our initial focus: two levels (upper, lower)– block: minimum unit of data – hit:

• data requested is in the upper level (hit ratio - hit time)

– miss: • data requested is not in the upper level (miss ratio - miss penalty)


Princípio da localidade

espaço deendereçamento

tempo

freqüência deacesso em T

espaço de endereçamento

Temporal

Espacial


Visão em dois níveis

Processador

Transferencia de dados (bloco)

Localidade temporal:guardar os mais usados

Localidade espacial:transf. em blocos em vez de palavras


Cache

a. Before the reference to Xn

X3

Xn – 1

Xn – 2

X1

X4

b. After the reference to Xn

X3

Xn – 1

Xn – 2

X1

X4

Xn

X2X2

Referência à posição Xn


• Two issues:– How do we know if a data item is in the cache?– If it is, how do we find it?

• Our first example:– block size is one word of data– "direct mapped”

Políticas: • mapeamento de endereços entre cache e memória• escrita: como fazer a consistência de dados entre cache e memória• substituição: qual bloco descartar da cache

For each item of data at the lower level, there is exactly one location in the cache where it might be.

e.g., lots of items at the lower level share locations in the upper level

Cache


• Mapping: address is modulo the number of blocks in the cache

Direct Mapped Cache

00001 00101 01001 01101 10001 10101 11001 11101

000

Cache

Memory

001

01

001

11

001

011

101

11

cache: 8 posições3 bits de endereço

memória: 32 posições5 bits de endereço

Index V Tag Data

000

001

010

011

100

101

110

111

N

N

N

N

N

N

N

N

Index V Tag Data

000

001

010

011

100

101

110

111

N

N

N

N

N

N

Y

N

10 M(10110)

Index V Tag Data

000

001

010

011

100

101

110

111

N

N

Y

N

N

N

Y

N

10 M(10110)

11 M(11010)

Index V Tag Data

000

001

010

011

100

101

110

111

Y

N

Y

N

N

N

Y

N

10 M(10110)

11 M(11010)

10 M(10000)

Index V Tag Data

000

001

010

011

100

101

110

111

Y

N

Y

Y

N

N

Y

N

10 M(10110)

11 M(11010)

10 M(10000)

00 M(00011)

Preenchimento da cache a cada miss

end10 end2 endcache Hit

22 10 110 110

26 11 010 010

22 10 110 110 H

26 11 010 010 H

16 10 000 000

3 00 011 011

16 10 000 000 H

18 10 010 010


Direct Mapped CacheAddress (showing bit positions)

20 10

Byteoffset

Valid Tag DataIndex

0

1

2

1021

1022

1023

Tag

Index

Hit Data

20 32

31 30 13 12 11 2 1 0

• mapeamento direto

• byte offset:

• só para acesso a byte

• largura da cache:v+tag+dado

• cache de 2n linhas:

• índice de n bits

• linha da cache: 1+(30-n)+32 v tag dado

• tamanho da cache= 2n*(63-n)


Via de dados com pipeline

• Data memory = cache de dados• Instruction memory = cache de instruções• Arquitetura

– de Harvard

– ou Harvard modificada

• Miss? semelhante ao stall– dados: congela o pipeline– instrução:

• quem já entrou prossegue• inserir “bolhas” nos estágios seguintes• esperar pelo hit

• enquanto instrução não é lida, manter endereço original (PC-4)

IM CPU DM

Harvard

IM

CPU

DM

Harvard modificada

Memória


The caches in the DECStation 3100

Address (showing bit positions)

16 14 Byteoffset

Valid Tag Data

Hit Data

16 32

16Kentries

16 bits 32 bits

31 30 17 16 15 5 4 3 2 1 0


Localidade espacial: aumentando o tamanho do bloco

Address (showing bit positions)

16 12 Byteoffset

V Tag Data

Hit Data

16 32

4Kentries

16 bits 128 bits

Mux

32 32 32

2

32

Block offsetIndex

Tag

31 16 15 4 32 1 0


• Read hits– this is what we want!

• Read misses– stall the CPU, fetch block from memory, deliver to cache, restart

• Write hits:– can replace data in cache and memory (write-through)– write the data only into the cache (write-back the cache later)

• também conhecida como copy-back• dirty bit

• Write misses:– read the entire block into the cache, then write the word

• Comparação– desempenho: write-back– confiabilidade: write-through– proc. paralelo: write-through

Hits vs. Misses (política de atualização ou escrita)


Largura da comunicação Mem - Cache – CPU

dezati

CPU

Cache

Bus

Memory

a. One-word-wi memory organi on

CPU

Bus

b. Wide memory organization

Memory

Multiplexor

Cache

CPU

Cache

Bus

Memorybank 1

Memorybank 2

Memorybank 3

Memorybank 0

c. Interleaved memory organization

• Supor:

• 1 clock para enviar endereço

• 15 clocks para ler DRAM

• 1 clock para enviar uma palavra de volta

• linha da cache com 4 palavras


Cálculo da miss penalty vs largura comunicação

• Uma palavra de largura na memória:– 1 + 4*15 + 4*1 = 65 ciclos (miss penalty)– Bytes / ciclo para um miss: 4 * 4 / 65 = 0,25 B/ck

• Duas palavras de largura na memória:– 1 + 2*15 + 2*1 = 33 ciclos– Bytes / ciclo para um miss: 4 * 4 / 33 = 0,48 B/ck

• Quatro palavras de largura na memória:– 1 + 1*15 + 1*1 = 17 ciclos– Bytes / ciclo para um miss: 4 * 4 / 17 = 0,94 B/ck– Custo: multiplexador de 128 bits de largura e atraso

• Tudo com uma palavra de largura mas 4 bancos de memória interleaved (intercalada)– Tempo de leitura das memórias é paralelizado (ou superpostos)

• Mais comum:endereço bits mais significativos

– 1 + 1*15 + 4*1 = 20 ciclos– Bytes / ciclo para um miss: 4 * 4 / 20 = 0,8 B/ck– funciona bem também em escrita (4 escritas simultâneas):

• indicado para caches com write through


Cálculo aproximado da eficiência do sistema

• objetivo: – tempo de acesso médio = estágio mais rápido

• supor dois níveis:– tA1 = tempo de acesso a M1– tA2 = tempo de acesso a M2 (M2+miss penalty)– tA = tempo médiode acesso do sistema– r = tA1 / tA2

– e = eficiência do sistema = tA1 / tA

tA = H * tA1 + (1-H) * tA2

tA / tA1 = H + (1-H) * r = 1/e

e = 1 / [ r + H * (1-r) ]

M1

M2

tA1

tA2

0%

20%

40%

60%

80%

100%

0% 20% 40% 60% 80% 100%

r=2

r=10

r=100

e

H

Use split caches because there is more spatial locality in code:

Miss rate vs block size

1 KB

8 KB16 KB

64 KB

256 KB

256

40%

35%

30%

25%

20%

15%

10%

5%

0%

Mis

s ra

t e

64164

Block size (bytes)

ProgramBlock size in

wordsInstruction miss rate

Data miss rate

Effective combined miss rate

gcc 1 6.1% 2.1% 5.4%4 2.0% 1.7% 1.9%

spice 1 1.2% 1.3% 1.2%4 0.3% 0.6% 0.4%

pior

menos local. espacial

pior

• fragmentação interna• menos blocos• miss penalty


Performance

• Simplified model:

execution time = (execution cycles + stall cycles) cycle timestall cycles = RD + WR stalls

RD stall cycles = # of RDs RD miss ratio RD miss penaltyWR stall cycles = # of WRs WR miss ratio WR miss penalty

(mais complicado do que isto)

• Two ways of improving performance:– decreasing the miss ratio– decreasing the miss penalty

What happens if we increase block size?


Exemplo pag 565 - 566

• gcc: instruction miss ratio = 2%; data cache miss rate = 4%• CPI = 2 (sem stalls de mem); miss penalty = 40 ciclos• Instructions misses cycles = I * 2% * 40 = 0.8 I• Sabendo que lw+sw= 36%

– data miss cycles = I * 36% * 4% * 40 = 0.58 I• N. de stalls de mem = 0.8 I + 0.58 I = 1.38 I

– CPI total = 2 + 1.38 = 3.38• Relação de velocidades com ou sem mem stalls = rel de CPIs

– 3.38 / 2 = 1.69

• Se melhorássemos a arquitetura (CPI) sem afetar a memória– CPI = 1– relação = 2.38 / 1 = 2.38– efeito negativo da memória aumenta (Lei de Amdhal)

• ver exemplo da pag 567: aumento do clock tem efeito semelhante


Decreasing miss ratio with associativity

1

2Tag

Data

Block # 0 1 2 3 4 5 6 7

Search

Direct mapped

1

2Tag

Data

Set # 0 1 2 3

Search

Set associative

1

2Tag

Data

Search

Fully associative


Decreasing miss ratio with associativity

Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data

Eight-way set associative (fully associative)

Tag Data Tag Data Tag Data Tag Data

Four-way set associative

Set

01

Tag Data

One-way set associative(direct mapped)

Block

0

7

123456

Tag Data

Two-way set associative

Set

0123

Tag Data


An implementationAddress

22 8

V TagIndex012

253254255

Data V Tag Data V Tag Data V Tag Data

3222

4-to-1 multiplexor

Hit Data

123891011123031 0


Performance

0%

3%

6%

9%

12%

15%

Eight-wayFour-wayTwo-wayOne-way

1 KB

2 KB

4 KB

8 KB

Mis

s r a

te

Associativity 16 KB

32 KB

64 KB

128 KB


Política de substituição

• Qual item descartar?

– FIFO

– LRU

– Aleatoriamente

• ver seção 7.5


Decreasing miss penalty with multilevel caches

• Add a second level cache:– often primary cache is on the same chip as the processor– use SRAMs to add another cache above primary memory (DRAM)– miss penalty goes down if data is in 2nd level cache

• Example (pag 576):– CPI of 1.0 on a 500MHz machine with a 5% miss rate, 200ns DRAM access

– Add 2nd level cache with 20ns access time and miss rate to 2%

– miss penalty (só L1) = 200ns/período = 100 ciclos

– CPI (só L1)= CPIbase + clocks perdidos = 1 + 5% * 100 = 6

– miss penalty (L2)= 20ns/período = 10 ciclos

– CPI (L1 e L2)= 1 + stalls L1 + stalls L2 = 1 + 5% * 10 + 2% * 100 = 3.5

– ganho do sistema em velocidade com L2 = 6.0 / 3.5 = 1.7

• Using multilevel caches:– try and optimize the hit time on the 1st level cache– try and optimize the miss rate on the 2nd level cache


Virtual Memory

• Main memory can act as a cache for the secondary storage (disk)

• Advantages:– illusion of having more physical memory– program relocation – protection

Physical addresses

Disk addresses

Virtual addresses

Address translation


Pages: virtual memory blocks

• Page faults: the data is not in memory, retrieve it from disk

– huge miss penalty, thus pages should be fairly large (e.g., 4KB)

– reducing page faults is important (LRU is worth the price)

– can handle the faults in software instead of hardware

– using write-through is too expensive so we use write-back

3 2 1 011 10 9 815 14 13 1231 30 29 28 27

Page offsetVirtual page number

Virtual address

3 2 1 011 10 9 815 14 13 1229 28 27

Page offsetPhysical page number

Physical address

Translation


Page Tables

Physical memory

Disk storage

Valid

1

1

1

1

0

1

1

0

1

1

0

1

Page table

Virtual pagenumber

Physical page ordisk address


Page Tables

Page offsetVirtual page number

Virtual address

Page offsetPhysical page number

Physical address

Physical page numberValid

If 0 then page is notpresent in memory

Page table register

Page table

20 12

18

31 30 29 28 27 15 14 13 12 11 10 9 8 3 2 1 0

29 28 27 15 14 13 12 11 10 9 8 3 2 1 0


Making Address Translation Fast

• A cache for address translations: translation lookaside buffer (TLB)

Valid

1

1

1

1

0

1

1

0

1

1

0

1

Page table

Physical pageaddressValid

TLB

1

1

1

1

0

1

TagVirtual page

number

Physical pageor disk address

Physical memory

Disk storage

Typical values

- TLB size: 32 - 4,096 entries- Block size: 1 - 2 page table entries- Hit time: 0.5 - 1 clock cycle- Miss penalty: 10 - 30 clock cycle- Miss rate: 0.01% - 1%


TLBs and caches

Valid Tag Data

Page offset

Page offset

Virtual page number

Virtual address

Physical page numberValid

1220

20

16 14

Cache index

32

Cache

DataCache hit

2

Byteoffset

Dirty Tag

TLB hit

Physical page number

Physical address tag

TLB

Physical address

31 30 29 15 14 13 12 11 10 9 8 3 2 1 0


TLBs and caches

Yes

Deliver datato the CPU

Write?

Try to read datafrom cache

Write data into cache,update the tag, and put

the data and the addressinto the write buffer

Cache hit?Cache miss stall

TLB hit?

TLB access

Virtual address

TLB missexception

No

YesNo

YesNo

Write accessbit on?

YesNo

Write protectionexception

Physical address


TLB, Virtual memory and Cache

Cache TLB Virtualmemory Possible? If so, under what circumstance?

Miss Hit Hit Possible, although the page table is never really checked if TLB hits.Hit Miss Hit TLB misses, but entry found in page table; after retry data is found in cache.Miss Miss Hit TLB misses, but entry found in page table; after retry data misses in cache.Miss Miss Miss TLB misses and is followed by a page fault; after retry, data must miss in cache.Miss Hit Miss Impossible: cannot have a translation in TLB if page is not present in memory.Hit Hit Miss Impossible: cannot have a translation in TLB if page is not present in memory.Hit Miss Miss Impossible: data cannot be allowed in cache if the page is not in memory.


Protection with Virtual Memory

• Support at least two modes

– user process

– operating system process (kernel, supervisor, executive)

• CPU state that user process can read but not write page table and TLB

– special instructions that are only available in supervisor mode

• Mechanisms whereby the CPU can go from user mode to supervisor , and vice versa

– user to supervisor : system call exception

– supervisor to user : return from exception (RFE)

• OBS: page tables (operating system´s address space)


Handling Page Faults and TLB misses

• TLB miss (software or hardware).

– the page is present in memory, and we need only create the missing TLB entry.

– the page is not present in memory, and we need to transfer control to the operating system to deal with a page fault.

• Page fault (exception mechanism).

– OS saves the entire state the active process.

– EPC = virtual address of the faulting page.

– OS must complete three steps:• look up the page table entry using the virtual address and find the location of referenced page on disk.• chose a physical page to replace; if the chosen page is dirty, it must be written out to disk before we can bring a new

virtual page into this physical page.• Start a read to bring the referenced page from disk into the chosen physical page.


Memory Hierarchies

• Where can a Block Be Placed?

Scheme name Number of sets Block per set

Direct mapped Number of blocks in cache 1

Set associative Number of blocks in cache

Associativity

Associativity (typically 2 – 8)

Fully associative 1 Number of block in the cache

FeatureTypical values

for cacheTypical values for

page memoryTypical values

for a TLBTotal size in blocks 1000 –100,000 2000 – 250,000 32 – 4,000Total size in kilobytes 8 – 8,000 8000 – 8,000,000 0.254 – 32Block size in bytes 16 – 256 4000 – 64,000 4 – 32Miss penalty in clocks 10 – 100 1,000,000 – 10,000,000 10 – 100Miss rate 0.1% -- 10% 0.00001% -- 0.0001% 0.01% -- 2%


Memory Hierarchies

• How Is a Block Found?

• OBS.: In virtual memory systems

– Full associativy is beneficial, since misses are very expensive

– Full associativity allows software to use sophisticated replacement schemes that are designed to reduce the miss rate.

– The full map can be easily indexed with no extra hardware and no searching required

– The large page size means the page table size overhead is relatively small.

Associativity Location method Comparisons requiredDirect mapped Index 1Set associative Index the set, search among elements Degree of associativity

Search all cache entries Size of the cacheFullSeparate lookup table 0


Memory Hierarchies

• Which Block Should Be Replaced on a Cache Miss?

– Random : candidate blocks are randomly selected, possibly using some hardware assistance.

– Least Recently Used (LRU): The block replaced is the one that has been unused for the longest time


Memory Hierarchies

• What Happens on a Write?

– Write-through• Misses are simpler and cheaper because they never require a block to be written back to

the lower level.• It is easier to implement than write-back, although to be practical in a high-speed

system, a write-through cache will need to use a write buffer

– Write-back (copy-back)• Individuals words can be written by the processor at the rate that the cache, rather than the

memory, can accept them.• Multiple writes within a block require only one write to the lower level in the hierarchy.• When blocks are written back, the system can make effective use of a high bandwidth transfer,

since the entire block is written


Modern Systems

• Very complicated memory systems:Characteristic Intel Pentium Pro PowerPC 604

Virtual address 32 bits 52 bitsPhysical address 32 bits 32 bitsPage size 4 KB, 4 MB 4 KB, selectable, and 256 MBTLB organization A TLB for instructions and a TLB for data A TLB for instructions and a TLB for data

Both four-way set associative Both two-way set associativePseudo-LRU replacement LRU replacementInstruction TLB: 32 entries Instruction TLB: 128 entriesData TLB: 64 entries Data TLB: 128 entriesTLB misses handled in hardware TLB misses handled in hardware

Characteristic Intel Pentium Pro PowerPC 604Cache organization Split instruction and data caches Split intruction and data cachesCache size 8 KB each for instructions/data 16 KB each for instructions/dataCache associativity Four-way set associative Four-way set associativeReplacement Approximated LRU replacement LRU replacementBlock size 32 bytes 32 bytesWrite policy Write-back Write-back or write-through


• Processor speeds continue to increase very fast— much faster than either DRAM or disk access times

• Design challenge: dealing with this growing disparity

• Trends:

– synchronous SRAMs (provide a burst of data)

– redesign DRAM chips to provide higher bandwidth or processing

– restructure code to increase locality

– use prefetching (make cache visible to ISA)

Some Issues

1998 Morgan Kaufmann Publishers Mario Côrtes - MO401 - IC/Unicamp- 2002s1 Ch7-1 Chapter Seven...

Documents

Transcript of 1998 Morgan Kaufmann Publishers Mario Côrtes - MO401 - IC/Unicamp- 2002s1 Ch7-1 Chapter Seven...