1 Chapter 5 Memory Hierarchy Design 5.1 Introduction 5.2 Review of the ABCs of Caches 5.3 Cache...

1

Chapter 5 Memory Hierarchy Design

5.1 Introduction5.2 Review of the ABCs of Caches5.3 Cache Performance5.4 Reducing Cache Miss Penalty5.5 Reducing Miss Rate5.6 Reducing Cache Miss Penalty of Miss

Rate via Parallelism5.7 Reducing Hit Time5.8 Main Memory and Organizations for

Improving Performance

2

Chapter 5 Memory Hierarchy Design

5.9 Memory Technology

5.10 Virtual Memory

5.11 Protection and Examples of Virtual Memory

5.12 Crosscutting Issues: The design Memory Hierarchy

5.13 Putting it all together:Alpha 21264 memory hierarchies

5.14 Another view: The emotion engine of the Sony Playstation 2

5.15 Another view: The Sun Fire 6800 Server

5.16 Fallacies and Pitfalls

5.17 Concluding remarks

5.18 Historical perspective and references

3

The levels in a typical memory hierarchy in embedded, desktop, and server computers

An economical solution to that desire is a memory

hierarchy, which takes advantages of locality and

cost-performance of memory technology

Upper

4

Starting with 1980 performance as a baseline, the gap in performance between memory and CPUs is plotted over time

5

5.2 Review of the ABCs of Caches

• Cache• Virtual memory• Memory stall cycles• Direct mapped• Valid bit• Block address• Write through• Instruction cache• Average memory access time• Cache hit• Page• Miss penalty

6

Review of the ABCs of Caches

• Fully associative• Dirty bit• Block offset• Write back• Data cache• Hit time• Cache miss• Page fault• Miss rate• N-way set associative• Least-recently used• Tag field

7

Review of the ABCs of Caches

• Write allocate

• Unified cache

• Miss per instruction

• Block

• Locality

• Set

• Random replacement

• Index field

• No-write allocate

• Write buffer

• Write stall

8

The typical levels in the hierarchy slow down and get larger as we move away from the CPU for a large workstation or small

servers

9

Cache performance review

penaltyMissmissesofNumbercyclesstallMemory

penaltyMissnInstructio

MissesIC

penaltyMissrateMissnInstructio

accessesMemoryIC

10

Cache performance review

penaltyMissratemissadninstructioperadsICcyclesclockstallMemory ReRe

penaltymissWriteratemissWriteninstructioperWritesIC

penaltyMissrateMissnInstructio

accessesMemoryICcyclesclockstallMemory

11

Cache miss rate and penalty

12

Cache miss rate and penalty

13

Measuring miss rate as misses per instruction

countnInstructioaccessesMemoryrateMiss

nInstructioMisses

nInstructioaccessesMemory

rateMiss

14

Memory stall cycles

15

Four memory hierarchy questions

• Q1:Where can a block be placed in the upper level? (Block placement)

• Q2:How is a block found if it is in the upper level?(block identification)

• Q3:Which block should be replaced on a miss?(Block replacement)

• Q4:What happens on a write?(Write strategy)

16

Q1:Where can a block be placed in the upper level? (Block placement)

• Direct mapped: Only one place

• Fully associative:Anywhere in the cache

• Set associative:A set is a group of blocks in the cache.

)()( cacheinsetsofNumberMODaddressBlock

17

This example cache has eight block frames and memory has 32 blocks

18

The three portions of an address in a set-associative or direct-mapped cache

hitaforitagacomparedisfieldtagThe int

settheSelect

blockthefromdatadesiredtheSelect

19

Q3:Which block should be replaced on a cache miss

• Random

• Least-recently used(LRU)

• First in, first out(FIFO)

20

Data cache misses per 1000 instructions comparing least-recently used, random, and first in, first out replacement fo

r several sizes and associativities

21

Q4:What happens on a write?

• Two write policies: Write through(page. 401) Write back• Two options on a write miss: Write allocate-The block is allocated on a write miss, followed by the write hit actions above. No-write allocate-This apparently unusual alternative is write misses do not affect the cache.

Instead, the block is modified only in the lower-level memory(Main Memory).

.

22

No-write allocate versus write allocate

23

The organization of the data cache in the Alpha 21264 microprocessor(2-way set)

addressvirtualbit48addressphysicalbit41

24

Physical address

• Physical address=38bit block address+6bit block offset

• Ex.

• 38bit block address=9bit index +29 bit tag

92512264

536,652

ityassociativSetsizeBlocksizeCacheIndex

25

Miss per 1000 instructions for instruction, data, and unified cache of different sizes

26

Cache performanceAverage memory access time

penaltyMissrateMisstimeHittimeaccessmemoryAverage

..100~75 CCpenaltyMiss

.0.1~25.0 nstimeHit

Example

%1rateMiss

27

Unified cache separated cache

28

Unified cache or separated cache

29

Unified cache or separated cache

firstInstuction

30

Average memory access time and processor performance

• Average memory access time depend on many factors:

-Memory stalls

-CPU(out-of-order execution),…

)( cyclesclockstallMemorycyclesclockexecutionCPUtimeCPU

timecycleClock

31

Performance impact by cache

32

Performance impact by cache

151CPI 4CPI 1CPImisswithoutcacheWithcacheWithcacheWithout

33

Cache misses have a double-barreled impact on a CPU

• A lower CPI. The lower the CPIexecution,the higher the relative impact of a fixed numbe

r of cache miss clock cycles.

• A fast clock. When calculating CPI, the cache miss penalty is measured in CPU cloc

k cycles for a miss. Therefore, even if memory hierarchies for two computers are identical, the CPU with the higher clock rate has a larger number of clock cycles per miss and hence a higher memory portion of CPI.

34

Directed mapped or two-way set associative

35ex-P410-1

36

Miss penalty and out-of-order execution processors

• Redefine memory stalls to lead to a new definition of miss penalty as on overlapped latency.

• Miss Latency

Length of memory latency

Length of latency overlap

latencymissTotalnInstructio

MissesnInstructiocyclesstallMemory

(

)latencymissOverlapped

37

Out-of-order execution with a direct-mapped cache

Overlapped

38

Summary of performance equations in this chapter

39

We organize 17 cache optimization into four categories

• Reducing the miss penalty (Section 5.4):multilevel caches, critical word first, read miss before write miss, merging write buffers, and victim caches.

• Reducing the miss rate (Section 5.5): Larger block size, larger cache size, higher associativity, way predication and pseudoassociativity, and compiler optimizations.

• Reducing the miss penalty or miss rate via parallelism (Section 5.6): nonblocking caches, hardware prefetching, and compiler prefetching.

• Reducing the time to hit in the cache (Sections 5.7): small and simple caches, avoiding address translation, pipelined cache access, and trace caches.

40

5.4 Reducing cache miss penalty• First miss penalty reduction technique: multilevel caches

• Multiple miss rates Local miss rate-Miss rateL1, or Miss rateL2

Global miss rate-Miss rateL1 for level 1 or Miss rateL1xMiss rateL2 for level 2.

111 LLLpenaltyMissrateMisstimeHittimeaccessmemoryAverage

11 LLrateMisstimeHittimeaccessmemoryAverage

2221 LLLLpenaltyMissrateMisstimeHitpenaltyMiss

)(222 LLL

penaltyMissrateMisstimeHit

penaltyMissrateMissnumberInitialpersonsofnumberAverage

41

The first-level and the second-level caches

42

The first-level and the second-level caches

43

Miss rates versus cache size for multilevel caches ratemisslevelSecond

44

Relative execution time by second-level cache size

45

Two major questions for the design of the second-level cache

• Will it lower the average memory access time portion of the CPI, and how much does it cost?

• The second-level caches should be much bigger than the first.

• Multilevel inclusion is the natural policy for memory hierarchies: L1 data are always present in L2.

• Multilevel exclusion: L1 data is never found in an L2 cache(prevents wasting space in the L2 cache)

46

The impact of second-level cache associativity of its miss penalty

47

Second miss penalty reduction technique:Critical word first and early restart

• This strategy is impatience: Don’t wait for the full block to be loaded before sending the requested word and restarting the CPU.

Critical words first-Request the missed word first from memory and send it to the CPU as soon as it arrives; let the CPU continue execution while filling the rest of the words in the block. Critical-word-first fetch is also called wrapped fetch and requested word first.

Early restart-Fetch the words in normal order, but as soon as the requested word of the block arrives, send it to the CPU and let the CPU continue execution.

48

The average miss penalty

49

Third miss penalty reduction techniques:Giving priority to read misses over writes

• Fourth miss penalty reduction techniques: Merging write buffer (Fig. 5.12)

• Fifth miss penalty reduction techniques: Victim Caches (Fig. 5.13)

• Summary of miss penalty reduction techniques: The principle of locality Retrieving the word Giving priority to read misses over writes A victim caches

50

R2=R3?

51

To illustrate write merging, the write buffer on top does use it while the write buffer on the bottom does

entrygleaoMerge sinint

52

Placement of victim cache in the memory hierarchy

53

5.5 Reducing miss rate

• Three simple categories of cache misses: Compulsory-cold-start misses of first-reference misses Capacity-the cache cannot contain all the blocks Conflict-collision misses or interference misses.• Associativity Miss Rate Eight-way Four-way Two-way One-way

54

Total miss rate for each size cache and percentage of each according the three C’s

55

Total miss rate (top) and distribution of miss rate (bottom) for each size cache according to the three C’s for the data in Fig 5.14

56

Miss rate versus block size for five different-sized caches

57

First miss rate reduction technique: Larger block size

• Larger blocks will reduce compulsory misses-Temporal locality and spatial locality.

• Larger blocks increase the miss penalty(the size of blocks increase)

• Larger blocks increase the miss rate(reduce the number of blocks in the cache)

58

Actual miss rate versus block size for five different-sized cached in Fig 5.16

59

Average memory access time

60

Average memory access time versus block size for five different-sized caches in Fig. 5.16

61

Second miss rate reduction technique:larger caches

• The size of second- or third-level caches in 2001=the size of main memory in the desktop computers in 1991.

• Third miss rate reduction technique:higher associativity

Two general rules of thumb:

-- eight-way set associatives<->fully associative

--A direct-mapped cache of size N has about the same

miss rate as a two-way set-associative cache of size

N/2

62

1-way, 2-way, 4-way and 8-way

63

1-way, 2-way, 4-way and 8-way

64

Average memory access time using miss rate in Fig. 5.14 for parameters in the example

65

Fourth miss rate reduction technique:way predication and pseudoassociative caches

• Way predication saves pipeline stages in more than 85% of the instruction fetches

• Pseudoassociative or column associative

Next level checking

A simple way is to invert the most significant bit of the index field to find the other block in the “pseudoset”

66

Relationship between regular hit time, pseudohit, and miss penalty

67

Fifth miss rate reduction technique:Compiler optimizations

• Reordering the instruction-reduce the miss rates by 50%?

• Long cache block• Improve the spatial and temporal locality:Loop

Interchange, Blocking/* Before */for (j=0; j<100; j=j+1) for (i=0; i<5000; i=i+1) x[i][j]= 2 * x[i][j];

/* After */for (j=0; j<5000; j=j+1) for (i=0; i<100; i=i+1) x[i][j]= 2 * x[i][j];

68

Blocking

• Storing the array row by row (row major order) or column by column(column major order) does solve the problem because both rows and columns are used in every iteration of the loop. Such orthogonal accesses mean that transformations such as loop interchange are not helpful.

• Instead of operating on entire rows or columns of an array, blocked algorithms operate on submatrices or blocks.

/*Before */ for (i=0; i<N; i=i+1) for (j=0; j<N; j=j+1) { r=0; for ( k=0; k<N; k=k+1) r=r+y[i][k]*z[k][j]; x[i][j]=r; }

In the worst case, there would be 2N3 + N2 memory words accessed for N3 operations

69

Blocking factor

• The original code is changed to compute on a submatrix of size B by B.

/*After */ for (jj=0; jj<N; jj=jj+B) for (kk=0; kk<N; kk=kk+B) for(i=0; i<N; i=i+1) for(j=jj; j<min(jj+B, N); k=k+1) { r=0; for (k=kk; k<min(kk+B, N); k=k+1) r=r + y[i][k]*z[k][j]; x[i][j]=x[i][j] + R; };

In the worst case, there would be 2N3/B + N2 memory words accessed for N3 operations

70

A snapshot of the three arrays x, y, and z when i=1

71

The age of access to the arrays x, y, and z

72

5.6 Reducing cache miss penalty or miss rate via parallelism

• First miss penalty/rate reduction technique: Nonblocking caches to reduce stall on cache misses

--The CPU could continue fetching instructions from the instruction cache while waiting for the data cache to return the missing dataA nonblocking cache or lockup-free cache

--Hit under 1 miss, or hit under multiple misses,..

73

Two-way set

74

Ratio of the average memory stall time for a blocking cache to hit under-miss schemes as the number of

understanding misses is varied for 18 SPEC92 programs

75

Second miss penalty/rate reduction technique:Hardware prefetching of instructions and data

• Instruction prefetch is frequently done in hardware outside of the cache.

• The processor fetches two blocks on a miss: the requested block and the consecutive block (to the instruction stream buffer).

• A stream buffer would catch 15% to 25% of the misses from a 4KB direct-mapped instruction cache with 16-byte blocks.

• Eight stream buffers can capture 50% to 70% of all misses from a processor with two 64 KB four way set-associate caches.

76

Average access time, cache size, miss rate

77

Average memory access time

78

Third miss penalty/rate reduction technique:compiler-controlled prefectching

• The compiler insert prefetch instructions to request to request the data before they are needed.

Two flavors of prefetch:

* Register prefetch will load the value into a register.

* Cache prefetch loads data only into the cache and not the register.

• Faulting prefetch:The address does or does not cause an exception for virtual address faults and protection violation.

• Non-faulting prefetches simply turn into no-ops if they would normally result in an exception.

79

Compiler-controlled prefetching

• The most effective prefetch is “semantically invisible” to a program.

It doesn’t change the contents of registers and memory.

It cannot cause virtual memory faults.

• The goal of prefetching is to overlap execution with the prefetching of data.

80

Data cache misses

81

Data cache misses

82

The time saved

83

5.7 Reducing Hit Time

• First hit time reduction technique:small and simple caches

*Smaller hardware is faster, and a small cache certainly helps the hit time.

*Keep the cache simple

*

84

Access times as size and associativity vary in a CMOS cache

timeHit

85

Second hit time reduction technique: Avoiding address translation during indexing of the cache

• Physical address Virtual address• Two tasks: Indexing the cache and comparing address

• Process switchingthe virtual address refer to different physical address, the cache can be flushed. One solution:use a process-identifier tag(PID). By recycled the PID instead of flushing the cache.

86

Synonyms or aliases

• Synonyms or aliases:Two different virtual addressthe same physical address.

*Hardware solutions guarantee every cache block a unique physical address.

*Software solutions:page coloring is simply set-associative mapping applied to virtual memory: The 4KB(1212) pages are mapped using 64(26) set to ensure that the physical and virtual address match in the last 18 bits. This restriction means a direct-mapped cache that is 218 (256K) bytes or smaller can never have duplicate physical address for blocks.

87

Miss rate versus virtually addressed cache size of a program measured three ways: without process switches(uniprocess), with process switches using a proces

s-identifier tag (PID), and with process switches but without PIDs(purge)

88

Third hit time reduction technique: pipelined cache access

• The pipeline for the Pentium takes 1 clock cycle to access the instruction cache.

Fourth hit time reduction technique: trace caches

• A trace cache finds a dynamic sequence of instructions including taken branches to load into a cache block.

89

Summary of cache optimization showing impact on cache performance and complexity for the techniques in Section 5.4-5.7

90

5.8 Main memory and organization for improving performance

• Cache Main memoryI/O• Memory bandwidth Memory speed New organization• Assume the performance of the basic memory organization is 4 clock cycles to send the address 56 clock cycles for the access time per word 4 clock cycles to send a word of data Given given a cache block of 4 words, and that a word is 8

bytes, the miss penalty is 4X(4+56+4)=256 clock cycles, with a memory bandwidth of 1/8 byte (32/256) per clock cycle.

91

First technique for higher bandwidth: wider main memory

• Double or quadrupling the width of the cachedouble or quadrupling the memory bandwidth

• With a main memory width of 2 words, the miss penalty in our example would drop from 4X64 CC to 2X64CC.

• The width of the cache the error correction problem

92

Second technique for higher bandwidth: simple interleaved memory

• Number of memory banks Parallelism• Miss penalty of Fig. 5.27(c): 4 + 56 + ( 4 X 4 )CC.

4 X 2 X 4 Bytes/76CC=0.4 Bytes/CC• The mapping of address of banks effects the

behavior of the memory system.• The address of the four banks are interleaved

Bank 0 has all words whose address modulo 4 is 0

Bank 1 has all words whose address modulo 4 is 1

93

Three examples of bus width, memory width, and memory interleaving of achieve higher memory bandwidth

94

Four-way interleaved memory

95

What can interleaving and wide memory buy?

96

What can interleaving and wide memory buy?

97

How many banks should be included

• Number of banks >= Number of clock cycles to access word in bank

• Capacity per memory chip increases, there are fewer chips in the same-sized memory system, making multiple banks much more expensive.

• A second disadvantage of memory banks is again the difficulty of main memory expansion.

98

Third technique for higher bandwidth:independent memory banks

• Interleaving sequential accessesmultiple independent accesses

• Cache read in bank i, cache write in bank j• Multiple cache accesses• Nonblocking caches allow the CPU to proceed be

yond a cache miss.• Multiprocessors share a common memory provide

further motivation for memory banks.

99

5.9 Memory technology

• Memory latency: Access time and cycle time

Access time is the time between when a read is requested and when the desired word arrives, while cycle time is the minimum time between requests to memory.

• One reason that cycle time is greater than access time is that the memory needs the address lines to be stable between accesses.

100

DRAM Technology

• Capacity of DRAM Number of address lines Multiplex the address lines: Row access strobe and column access strobe.

• DRAM need refresh it contentoccasionally unavailable.

• Refresh time<=5% of the total time

• Fourfold improvement in capacity every three years.

• 5% performance improvement in row access time

101

Internal organization of a 64M bit DRAM

102

Times of fast and slow DRAMs with each generation

103

SRAM technology

• SRAM need not refresh.• Six transistors per bit• SRAM design are concerned with speed an

d capacity.• The capacity of DRAMs is roughly 4 to 8 ti

mes that of SRAMs. The cycle time of SRAM is 8 to 16 times faster than DRAMs, but they are also 8 to 16 times as expensive.

104

Embedded processor memory technology: ROM and flash

• ROM for the embedded program and the constant

• Flash memory allows the embedded device to alter nonvolatile memory after the system is manufactured.

• Flash memory allows reading at almost DRAM speed, but writing flash is 10 to 100 times slower.

105

Improving memory performance in a standard DRAM chip

• Improving bandwidth of DRAM 1. Fast page mode: allow repeated accesses to the

row buffer without another row access time. 2. Synchronous DRAM 3. The third major DRAM innovation to increase

bandwidth is to transfer data on both the rising edge and falling edge of the DRAM clock cycle.

. Adding little cost to the system while achieving a significant improvement in bandwidth.

106

Improving memory performance via new DRAM interface:RAMBUS

• DRAM A high speed interfaceCPU

• The first-generation RAMBUS interface dropped RAS/CAS. Replacing it with “a” bus that allows other accesses over the bus between the sending of the address and return of the data.RDRAM, Such a bus is called a packet-switched bus or split-transaction bus.

107

The second-generation RAMBUS interface

• Direct RDRAM (DRDRAM)1.6GB/Sec.• This interface include separate row- and column-

command buses instead of the conventional multiplexing; an 18-bit data bus; expanding from 4 to 16 internal banks per RDRAM to reduce bank conflicts;increasing the number of row buffers from 4 to 8; increasing the clock to 400 MHz; and a much more sophisticated controller on chip

• Because of the separation of data, row, column buses, three transactions can be performed simultaneously.

108

5.10 Virtual Memory

• Virtual memory divides physical memory into blocks and allocates them to different processes.

• A protection scheme restricts a process to the blocks belonging only to that process.

• Virtual memory automatically manages the two levels of the memory hierarchy represented by main memory and secondary storage.

• Virtual memory was invented to relieve programmers of program block allocation; it automatically manages the two levels of the memory hierarchy represented by main memory and secondary storage.

109

5.10 Virtual memory

• Relocation allows the same program to run in any location in physical memory.

• Page or segment is used for block, and page fault or address fault is used for miss

• Virtual address virtual memoryphysical address hardware/software

• Memory mapping or address translation.

110

The logical program in its contiguous virtual address space is shown on the left

111

There are further differences between caches and virtual memory beyond those quantitative ones

mentioned in Fig. 5.32• Replacement on cache misses is primarily controlled by hardware,

while virtual memory replacement is primarily controlled by the operating system.

• The size of the processor address determines the size of virtual memory, but the cache size is independent of the processor address size.

• Main memory Secondary storage.• Virtual memory system: Page, a fixed-size blocks (4096~65536) Segment, a variable-size blocks(1~[216~232])• Paged addressing has a single fixed-size address divided into page

number and offset within a page, analogous to cache addressing• Segmented address: the variable size of segments require for segment

and 1 word for an offset within a segment.

112

Typical ranges of parameter for caches and virtual memory

113

Example of how paging and segmentation divide a program

114

Paging and segmentation

115

Four memory hierarchy questions revisited

• Q1: Where can a block be placed in main memory? Operating systems allow blocks to be placed anywhere in main memory.• Q2: How is a block found if it in main memory?(Fig. 5.35) Both paging and segmentation rely on a data structure that is indexed by the page or segment number. • Q3: Which block should be replaced on a virtual memory miss? Almost all operating systems try to replace the least-recently used (LRU) block because if the past predicts the future • Q4: What happens on a write? The write strategy is always write back.

116

The mapping of a virtual address to a physical address via a page table

117

Techniques for fast address translation

• Paging1. One memory access to obtain the physical

address.

2. A second access to get the data.

• A general solution is to again rely on the principle of locality. (Address translation is in a special cache)

• This special address translation lookaside buffer (TLB)

• A TLB entry is like a cache entry where the tag holds portions of the virtual address and the data portion holds a physical page frame number, protection field, valid bit, and usually a use bit and dirty bit.

• OS manage the TLB

118

Fig. 5.36

• The TLB uses fully associative placement; thus, the translation begins (step 1 and 2) by sending the virtual address to all tags, the type of memory access is checked for a violation (also in step 2) against protection information in the TLB.

• Address space number (SPN) plays the same role as the PID.

• Step 3: 128:1 multiplexor, Step 4: Page frame(31 or 28 bit) and page offset(13-bit).

119

Operation of the Alpha 21264 data TLB during address translation

120

Selecting a page size

• The following favor a larger page size: 1.The size of the page table is inversely proportional to the page size;

memory can therefore be saved by making the page bigger. 2. A larger page size can allow larger caches will fast cache hit times. 3. Transferring larger pages to or from secondary storage, possibly over

a network, is more efficient than transferring. 4. The number of TLB entries is restricted, so a larger page size means

that more memory can be mapped efficiently, thereby reducing the number of TLB misses.

• The main motivation for a smaller page size is conserving storage. A small page size will result in less wasted storage when a continuous region of virtual memory is not equal in size to a multiple of the page size.

121

The overall picture of a hypothetical memory hierarchy going from virtual address to L2 cache access

122

5.11 Protection and examples of virtual memory

• Multiprogramming, a computer would be shared by several programs running concurrently, led to new demands for protection and sharing among program.

• Process, a running program plus any state needed to continue running it.

• Process exchange is called a process switch or context switch.

• The operating system designer must guarantee that processes do not interfere with each others’ computation.

• The operating system partitions main memory so that several different processes have their state in memory at the same time.

• The operating system allow communication between processes or save memory by reducing the number of copies of identical information.

123

Protecting processes

• The simplest protection mechanism is pair of register that checks every address to be sure that it falls between the two limits, base and bound. An address is valid if

Base =< Address =< Bound or (Base + Address) =< Bound• The computer designer has three more responsibilities in helping

the operating system designer protect process from each other. 1. Provide at least two modes; a user process and an operating

process. 2. Provide a portion of the CPU state that a user process can use

but not write. 3.Provide mechanism whereby the CPU can go from user mode

to supervisor mode and vice versa.

124

Protection mechanism

• The simplest way of doing protection is to add permission flags to each page and segment.

• Since few programs today intentionally modify their own code, an operating system can detect accidental writes to code by offering read-only protection to pages.

• Processes are thus protected from one another by having their page tables, each pointing to distinct pages of memory.

125

Protection mechanism

• “Ring” added to the protection structure expand memory access protection from two levels ( user and kernel) to many more.

• A program can’t unlock access to the data unless it has the key.

126

A paged virtual memory example:The Alpha memory management and the 21264 TLB

• The Alpha architecture uses a combination of segmentation and paging, providing protection minimizing page table size.

• This combination provides many advantages: Segmentation divides the address space and conserves page table, while paging provides virtual memory relocation, and protection.

• 64-bit address space(virtual)41-bit physical address space.

127

The organization of seg0 and seg1 in the Alpha

128

The mapping of an Alpha virtual address

129

Five protection fields of the each entry of the page table

• Valid –says that the page frame number is valid for hardware translation

• User read enable- Allows user programs to read data within this page

• Kernel read enable-Allows the kernel to read data within this page

• User write enable-Allows user programs to write data within this page

• Kernel write enable-Allows the kernel to write data within this page

130

The maximum virtual address and physical address

• The maximum virtual address and physical address is tied to the page size.

• Page size [8 KB ~ 64 KB] virtual address: 3 x13 + 16=55 bits; The max physical address: 32 + 18=48 bits.

• Memory management in the Alpha 21264 rely on page-level address translation and correct operation of the operating system to provide safety to multiple processes sharing the computer.

131

Memory hierarchy parameters of the Alpha 21264TLB(Translation lookaside buffer), PTE(Page ta

ble entry)

132

A segmented virtual memory example: Protection in the Intel Pentium

• The original 8086 used segment for address, no virtual memory or protection mechanism.

• The “successors” to the 8086 (IA-32): Lager address space, protection scheme to avoid security loopholes.

• Double the traditional two-level address space the Pentium has four levels of protection.

• The IA-32 divides the address space, allowing both the operating system and the user access to the full space.

• The IA-32 allows the operating system to maintain the protection level of the called routine for the parameters that are passed to it.

133

IA-32 has a page table entry (PTE), or a segment descriptor

• The fields found in PTE

1. Present bit-Equivalent to the PTE valid bit, used to indicate this is a valid translation

2. Base field-Equivalent to a page frame address, containing the physical address of the first byte of the segment.

3. Access bit-Like the reference bit or use bit in some architecture that is helpful for replacement algorithms.

4.Attributes field-Specifies the valid operations and protection levels for operations that use this segment.

134

Reference: Walter A Triebel and Avtar Singh, “The 8088 and 8086 Microprocessors: Programming,Interfacing, Software, Hardware, and Applications”, Prentice-Hall, Fourth Edition,

2003, • DPL:Descriptor privilege level• Privilege check by CPU• Conforming:When a service is initiates, the

current privilege level may change. This depends on whether the software that was interrupted in a code segment that was configured as conforming or nonconforming. If the interrupted code is in a conforming code segment, CPL does not change when the service routine is initiated.

135

The IA-32 segment description are distinguished by bits in the attributes field

136

Adding sharing and protection

• Half of the address space is shared by all all process and half is unique to each process, called global address space and local address space, respectively.

• Writing data A descriptorAction???

• Privilege level comparing between the caller and the called

• Adding safe calls from user to OS gates and inheriting protection level for parameters.

• To restrict entry into others’ code, the IA-32 provides a special segment descriptor, or call gate, identified by a bit in the attributes field.

• The call gate is to prevent the user from randomly jumping anywhere into a protected or more privileged code segment.

137

5.12 Crosscutting Issues:The design of memory hierarchies

• Superscalar CPU and number of ports to cache Parallelism: multiple instructions can be issued within a

single clock cycle.• Speculative execution and the memory system• To fulfill the demanding for instruction-level parallelism and

clock rate, increasingly the instruction cache and first part of instruction execution are merging.

• Embedded computers have bigger instruction cache• Embedded computers often are placed in real-time

environments where a set of tasks must be completed every time period.a portion of cache acts like a small scratch-pad memory under program control.

138

I/O and consistency of cached data

• There is little danger in the CPU seeing the old or stale copy.

• I/O devices give the opportunity for other devices to cause copies to be inconsistent or for other devices to read the stale copies. Cache coherency problem.

• I/O competing with the CPU for cache access will cause the CPU to stall for I/O.

• If a write-through cache were used, then memory would have an up-to-date copy of the information, and there would be no stale-data issue for output.

139

I/O and consistency of cached data

• Input requires some extra work. The software solution is to guarantee that no blocks of the I/O buffer designed for input are in the cache.

• One approach: A buffer page is marked as noncachable; the operating system always inputs to such a page.

• Another approach: the operating system flushes the buffer addresses from the cache before the input occurs.

• A hardware solution: Check the I/O addresses on input to see if they are in cache.(Checking of I/O addresses in parallel with processor cache accesses)

• Cache coherency protocols:Maintain coherency for multiple processors

140

The cache coherency problem

141

5.13 Putting all together: Alpha 21264 memory

• The 21264 is an out-of-order execution processor that fetches up to four instructions per clock cycle and executes up to six instructions per clock cycle.

• Virtual addressPhysical address 48-bit 44-bit 43-bit 41-bit• Alpha onThe chip loads the instruction cache

serially from an external PROM.(16K instructions) into the cache.

•

142

5.13 Putting it all together: Alpha 21264 memory hierarchy

• The preloaded instructions execute in privileged architecture library (PAL) mode.

• The software executed in PAL mode simply machine language routines with some implementation-specific extensions to allow access to low-level hardware, such as TLB.

• One of the first step is to update the instruction TLB with valid page table entries(PTEs) for this process. Kernel code updates the appropriate page table entry(in memory) for each page to be mapped. A miss in the TLB is handled by PAL code that relies on the TLB cannot change the TLB.

143


• Once the operating system is ready to begin executing a user process, it sets the PC to the appropriate address in segment seg0

• First, a 12-bit address is sent to the 64 KB instruction cache, along with a 35-bit page number. An 8-bit address space number (SPN) is also sent, for the same purpose as using ASNs in the TLB ( step 1).

• The instruction cache is virtually indexed and virtually tagged, so instruction TLB translations are only required on cache misses.

144


• To reduce latency, the instruction cache includes two mechanisms to begin early access of the next block: Way predication and line predication.

• Way predication: The way-predicting cache relies on a 1-bit field for every 16 bytes to predict which of two sets will be used next.

• Line predication: Predict the next sequential group on a cache miss.

• Step 2: The tag field of the PC is compared to the address from the tag portion of the cache, and the 8-bit process ASN to the tag ASN field.

•

145


• Step 3: The valid bit is also checked. If any field has the wrong value, it is a miss. On a hit in the instruction cache, the proper fetch block is supplied, and the next way and line prediction is loaded to read the next block.

• Step 4: An instruction cache miss causes a simultaneous check of the instruction TLB and the instruction prefetcher.

• Step 5: The fully associative TLB simultaneously searches all 128 entries to find a match between the address and a valid PTE.

• Step 6: If the desired instruction address is found in the instruction prefetcher

146


• Step 7: The instruction address eventually supplied directly by the prefetcher.

• Step 8: Otherwise, if there is no TLB exception, an access to the second-level cache is started.

• Step 9: The 35-bit block address (41-bit physical address – 6-bit block offset) is divided into an 18-bit tag and a 17-bit index.

• Step 10: The cache controller reads the tag from that index, and if it matches and is valid.

• Step 11: It returns the critical 16 bytes• Step 12: At the same time: a “request” is made for the next se

quential 64-block

147


• Step 13: The next sequential 64-byte block is loaded into the instruction prefetcher in the next 6 clock cycles

• Step 14: To save time, the prefetched instructions are passed around the CPU and then written to the instruction cache while the instructions execute in the CPU.

• Step 15: If the instruction is not found in the secondary cache, the physical address command is sent to the ES40 system chip set via four consecutive transfer cycles on a narrow, 15-bit outbound address bus.

• Step 16:The address and command use the address bus for 8 CPU cycles. The ES40 connects the microprocessor to memory via a crossbar to one of two 256-bit memory buses to service the request

148

The overall picture of the Alpha 21264 memory hierarchy

149

Alpha 21264/21164 performance speedup versus miss rate for SPECint2000

150

Performance of the 21264 memory hierarchy

• 21264: Out-of-order execution; A memory stall for one instruction may be completely hidden by successful completion of later instruction.

• The higher the speedup of the 21264 over the 21164(In-order execution, higher miss rate)

• The 21264’s ability to continue to execute during cache misses that stall the 21164 but hit in the L2 cache of the 21264.

• The peak rate of the CPI of 21264 is 0.25, or 4 instruction per clock cycle.

• SPEC95: the 21264 completes almost 2 instruction per clock cycle.

• Database ap. Higher miss rate + Higher branch mispreication

The server may heavier demands on the memory than do the microprocessor for.

151

CPI and misses per 1000 instructions for running a TPC-C-like database workload and the SPEC95 benchmarks on the Alpha 21264

in the Compaq ES40

152

5.14 Another view: The emotion engine of the Sony Playstation 2

• The operation of Playstation 2: The data are often a continuous stream.

• The steady stream of graphics and audio demanded by electronic games leads to a different approach to memory design.

• The style is high bandwidth via many dedicated independent memories.

• Fig. 5.46 show that much smaller caches capture the misses for multimedia applications

153

Three C’s for MPEG3 decode

154

Figure 5.47: The block diagram of the Sony Playstation 2

• Playstation 2: A game machine, there are interfaces for video, sound, and a DVD player

• Two standard computer I/O buses, USB and IEEE 1394, a PCMCIA slot, and a Modem.

• It includes a 34-MHz MIPS processor that also acts the emulation computer to run games for earlier Sony Playstations.

• It also connects to a standard PC audio card to provide the sound for the games.

155


• The 21264 microprocessor in 0.25-micron technology is about 160 mm2 and uses 150M transistors.

• The Emotion Engine in 0.25-micron technology is about 225mm2 and uses 13.5M transistors. And the Graphics Synthesizer is 279mm2.

• Two-instruction issue per one clock cycle.• 128-bit SIMD instructions for multimedia applications.• Vector Unit 0 is primarily a DSP-like coprocessor for the

CPU.• Vector Unit 1 has similar functions to VPU0, but it

normally operates independently of the CPU.

156

Block diagram of the Sony Playstation 2

157


• The PS2 uses two PC800 (400 MHz) DRDRAM chips using two channels, offering 32 MB of storage and a peak memory bandwidth of 3.2 MB/Sec.

• The graphic Synthesizer take rendering (移交 ) command from Emotion Engine in what are commonly called display lists. These are lists of 32-bit commands tell the renderer what shape to use and where to place them, plus what colors and textures to fill them.

• The Graphics Synthesizer contains the full video buffer and has a 2048-bit-wide interface so that pixel filling is not a bottleneck.

• The “separate” memories dedicated to individual functions to inexpensively achieve greater memory bandwidth for the entire system.

158


• A major insight shaped the design of the Emotion Engine: Generally, in a racing car game there are foreground objects that are constantly changing and background objects that change less in reaction to the event, although the background can be most of the screen. This observation led to a split of responsibilities.

• CPU-VPU0 handles the foreground action.• CPU-VPU1 handles the background. (1) The traditional 64-bit MIPS architecture including a floating-point unit (2) The MIPS architecture extended with multimedia instructions (VPU0) (3) Independent vector processor (VPU1) (4) Image processing unit to accelerate MPEG decoding.Split function + Dedicated memory (four memories)

159


• The programmer organizes all memories as two double buffers, one pair for incoming DMA data and one pair for the outgoing DMA data.

• The programmer next set up the “10” DMA channels, take are to meet the real-time deadline for realistic animation of 15 frames per second.

• Fig. 5.48: Serial, where CPU/VPU0 acts as a preprocessor on what to give VPU1 for it to create for the Graphics Interface using the scratchpad memory as the buffer.

Parallel: Where both the CPU/VPU0 and VPU1 create display lists. The display lists and Graphics Synthesizer have multiple context identifiers to distinguish the parallel display lists to product a coherent final image.

• Two dedicated buses: a 128-bit path between the CPU and VPU0 and a 128-bit path between VPU1 and the Graphics Interface. The programmer also chooses which bus to use when setting up the DMA channels.

160

Two modes of using emotion engine organization

161

5.15 Another view: The Sun Fire 6800 Server

• The Sun Fire 6800 is a midrange multiprocessor server with particular attention paid to the memory system.

• The emphasis of this server of this sever is cost-performance for both commercial computing and running database applications such as data warehousing and data mining

• This sever also includes special features to improve availability and maintainability.

• Miss rate is 0.5% for 1 MB data cache.

• MultiprocessorCoherency misses (a Fourth C)

162

Clock cycles per instruction for memory accesses versus off-chip cache size for a four-processor server

163

Technical summary of Sun Fire 6800 server Ultra SPARC III microprocessor

164

Sun Fire 6800 server block diagram

165

Sun Fire 6800 Server

• The data switch yields a peak bandwidth to off-chip memory of 11 GB/Sec.

• Error correction codes enable buses and memories to both detect and correct errors.

• The idea is to calculate and store parity over different subsets of the bits in protected word.

• When parity does not match, it indicates an error.• The Sun Fire ECC was also designed to detect any pair of

bit errors, and also to detect if a whole DRAM chip failed, turning all the bits of an 8-bit-wide chip to 0 Single error correcting/double error detecting.

166

Sun Fire 6800 Server

• Memory is connected directly to the processor to lower latency.(DRAM controller on the chip can save 80ns)

• UltraSPARC include the tags for L2 cache on chipSave 10 clock cycles off a miss.

• The on-chip caches are both four-way set associative ( 32KB instruction, 64KB data cache)

• To reduce the latency to the data cache, it combines and address adder with the word line decoder. summing address memory reduced latency from 3 to 2 clock cycles.

• The L1 data cache uses write through (no-write allocate) and L2 cache uses write back (Write allocate). Both caches provide parity and detect error.

167

Instruction and data misses per 1000 instruction as cache size varies from 4 KB to 4096 KB

168

5.16 Fallacies and pitfalls

• Fallacy:Predicting cache performance of one program from another.

• Pitfall: Simulating enough instructions to get accurate performance measures of the memory hierarchy

• Pitfall: Too small an address• Pitfall: Emphasizing memory bandwidth in DRAMs versus memor

y latency.• Pitfall: Delivering high memory bandwidth in a cache-based syste

m.• Pitfall: Ignoring the impact of the operating system on the perform

ance of the memory hierarchy.• Pitfall: Relying on the operating systems to change the page size o

ver time

169

Instruction misses per 1000 references for five inputs to perl benchmark from SPEC2000

170

Comparison of price SDRAM versus DRDRAM in memory modules and in systems in 2000

171

Top 10 in memory bandwidth as measured by the copy portion of the steam benchmark

[McCalpin2001]

172

Misses and time spent in misses for applications and operating system

173

5.17 Concluding Remarks

• Memory speedBalancingCPU speed• Do 8KB page makes sense with terabyte main me

mory?• The design decisions at all these levels interact, an

d the architect must take the whole system wise to make wise decisions.

• New inventions: prefetching, cache-aware compiler, increasing page size.

• Design timeBalancingdebug time

174

Desktop, embedded, and server microprocessors in 2001.

176

5.18 Historical Perspective and references

• A two levels of memory hierarchy—1962• Translation lookaside buffer—1978• IBM 360 – 32 bit of address(1964)• Measure program behavior as memory traffic as well as mi

ss rate—1968• Spatial locality and temporal locality –1982• Three C’s model—1987• The victim cache– 1993• Nonblocking cache—1981,1998• Multilevel inclusion property—1988• Multilevel exclusion property—1994

177

5.18 Historical Perspective and references

• Prefetching via streaming buffer—1990• The streaming buffer that work well with no

nblocking loads and speculative execution for in-order processors—1995

• Execution out-of order –1997• The measurement of SPEC2000 benchmark

s collected by Cantin and Hill--2001

1 Chapter 5 Memory Hierarchy Design 5.1 Introduction 5.2 Review of the ABCs of Caches 5.3 Cache...

Documents

Transcript of 1 Chapter 5 Memory Hierarchy Design 5.1 Introduction 5.2 Review of the ABCs of Caches 5.3 Cache...