CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep...

46
CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation

Transcript of CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep...

CPU Cache Prefetching

Timing Evaluations of Hardware Implementation

Ravikiran Channagire & Ramandeep Buttar

ECE7995 : Presentation

Introduction

Previous research and issues in it.

Architectural Model

Processor, Cache, Buffer, Memory bus, Main Memory etc.

Methodology

The workload.

The Baseline System

MCPI, Relative MCPI and other terms.

Effects of System Resources on Cache Prefetching

Comparison.

System Design

OPTIprefetch system.

Conclusion

TOPICS

Introduction

Terms used in this paper -

Cache Miss – Address not found in the cache

Partial Cache Miss – Address issued by prefetching unit and sent to memory

True Cache Miss – Ratio of cache misses to cache references

Partial Miss Ratio – Ratio of partial cache misses to number of cache references

Total Miss Ratio – Sum of true miss ratio and partial miss ratio

Issued Prefetch – Sent to prefetch address buffer by prefetching unit.

Lifetime of a Prefetch- Time from prefetch sent to memory to loading data

Useful Prefetch – Address referenced by processor

Useless Prefetch – Address not referenced by the processor

Aborted Prefetch – Issued prefetch discarded in prefetch address buffer

Prefetch Ratio – Ratio of issued prefetches to cache references

Success Ratio – Ratio of total no. of useful prefetches to total no. of issued prefetches.

Global Success Ratio – Fraction of Cache misses avoided/ partially avoided

Prefetching : Effective in reducing the cache miss ratio.

Doesn’t always improve CPU performance - why?

Need to determine when and if hardware Prefetching is useful.

How to improve performance?

Double ported address array.

Double ported or fully buffered data array.

Wide bus – Split and non-split operations.

Prefetching can be improved with a significant investment in

extra hardware.

Why cache memories?

Factors affecting cache performance.

Block size, Cache Size, Associativity, Algorithms used.

Hardware Cache Prefetching : dedicated hardware w/o

software support.

Which block to prefetch? – Simplest

When to prefetch?

Simple Prefetch algorithms

Always Prefetch.

Prefetch on misses.

Tagged prefetch.

Threaded prefetching.

Bi-directional prefetching

Number of cache blocks – Fixed or Variable?

Disadvantages of Cache Prefetching –

Increase in memory traffic because of prefetches which are

never referenced.

Memory pollution – Useless prefetches displacing useful

prefetches. Most hazardous when cache sizes are small and

block sizes are large.

Factors degrading the performance even when miss ratio is

decreased –

Address tag array busy due to prefetch lookups.

Cache data array busy due to prefetch loads &

replacements.

Memory bus busy due to prefetch address transfers & data

fetches.

Memory system busy due to prefetch fetches and

replacements.

Architectural Model

Architecture Model

Processor :

Five stages in the pipeline

Instruction fetch.

Instruction decode.

Read or write to memory.

ALU computation.

Register file update.

Cache : Single or Double Ported

Instruction Cache

Data Cache

Write Buffer :

Prefetching Units :

One for each cache.

Receives information like cache miss, cache hit,

instruction type, branch target address.

What if buffer is full?

Memory Bus

Split and Non-split transactions

Bus Arbitrator

Main Memory

Methodology

Methodology used :

25 commonly used real programs from 5 workload categories

Computer Aided Design (CAD)

Compiler Related Tools (COMP)

Floating Point Intensive Applications (FP)

Text Processing Programs ( TEXT)

Unix Utilities ( UNIX)

The Baseline System

Default System Configuration for the Baseline System

Cycles Per Instructions contributed by Memory accesses (MCPI)

Why MCPI is preferred over cache miss ratio / memory access time?

Covers every aspect of performance.

Excludes aspects of performance which can not be affected by a

cache Prefetching strategies e.g. efficiency of instruction

pipelining.

Relative MCPI – Should be smaller than 1

Relative and Absolute MCPI

A cache prefetching strategy is very sensitive to the type of program

the processor is running.

CPU Stall Breakdown

Instruction and Data Cache Miss Ratios

True Miss Ratio – Ratio of total number of true cache misses to the

total number of processor references to the cache.

Partial Miss Ratio – Ratio of the total number of partial cache misses

to the total number of processor reference to the cache.

Total Miss Ratio – Sum of the true miss ratio and the partial miss ratio.

Success Ratio and Global Success Ratios

Success Ratio – Ratio of the total number of useful prefetches issued

to the total number of prefetches issued.

GSR – Fraction of cache misses which are avoided or partially

avoided.

Average Distribution of Prefetches.

Useful Prefetches

Useless Prefetches

Aborted Prefetches.

Major limitations which reduces the effectiveness of cache Prefetching

are conflicts and delays in accessing the caches, the data bus and the

main memory.

Ideal system characteristics -

Ideal Cache

Special access port to the tag array for prefetch lookups

Special access port to the data array for prefetch loads.

No need to buffer the perfetched blocks.

Ideal Data Bus

Private bus connecting the Prefetching unit and main memory

Ideal Main Memory

Dual ported. Takes 0 time to access these memory banks.

Effects of System Resourceson Cache Prefetching

Effect of different design patterns

Single vs. Double Ported Cache Tag Arrays

Single vs. Double Ported Cache Data Arrays and Buffering

Cache Size

Cache Block Size

Cache Associativity

Split vs. Non-split Bus Transaction

Bus Width

Memory Latency

Number of Memory Banks

Bus Traffic

Prefetch Look-ahead Distance

Single vs. Double Ported Cache Tag Arrays

Double ported cache tag array gives best results first with ‘bi-dir’ and then with ‘always’ and ‘thread’ strategies

If a prefetch strategy looks up the cache tag arrays frequently, extra access ports to the tag array are vital.

Single vs. Double Ported Cache Data Arrays and Buffering

Far less important than that of the cache tag arrays because prefetch strategies use data arrays less frequently.

For single ported data array relative MCPI > 1, but reduces with double ported array.

Conflicts over cache port vanishes for double ported data array – no stall.

Draw back - Adding an extra port to cache data array is very costly

More practical sol – Provide some buffering for the data arrays.

Result – for single ported buffered data array, the relative MCPI decreases by 0.26 on average and this is almost as good as when the cache data array are double ported.

Buffered data array

Cache Size

Performance of a prefatch strategy improves as the cache size increases.

For large caches, prefatched block resides in cache for longer period.

Bigger the cache, the fewer the cache misses ; hence, prefetches are less likely to interfare with normal cache operation

Cache Block Size

For 16 or 32 byte long cache blocks, most prefatching strategies perform better than when there is no prefatching.

MCPI increases with increase in block size – result of more cache port conflicts.

Cache Associativity

Relative MCPI decreases by 0.07 on avg. when the cache changes from direct mapped to two way mapped.

MCPI remains almost constant as cache associativity increases.

Split vs. Non-split Bus Transaction

• Relative MCPI decreases by 14 % on avg. when bus transaction changes from nonsplit to split.

• Reason – most aborted prefetches become useful prefetches when data bus supports split transaction.

Bus Width

• As the bus width increases, the relative MCPI begins to fall below one for the base system

• Reason – there are fewer cache port conflicts.

• Assumptions – cache data port must be as wide as data bus

Memory Latency (CPU Cycles)

Two reasons account for these U shaped curves

Fewer cache port conflicts.

More data bus conflicts.

• Relative MCPI decreases as the memory latency increases from 8 to 64 processor cycles, but it starts to rise for further increase in latency

Number of Memory Banks – (Bus transaction is nonsplit)

• Relative MCPI decreases by 0.18 on average when the number of memory banks increases from 1 to 4.

• Multiple prefatches can occur in parallel when there are more memory banks.

Bus Traffic – Bus Utilization by DMA (%)

As traffic increases, Relative MCPI converges to 1 because there is less and less bus bandwidth available to send prefetch requests to the main memory.

True in baseline system -- heavier bus traffic helps to reduce the amount of undesirable prefatches and, hence, relative MCPI decreases.

Prefetch Lookahead Distance (in blocks)

Better performance could be achieved by prefetching blocks p+LA instead of p.

P – original block address requested by prefetching strategyand LA – Look ahead distance in blocks

For all strategies except thread, relative MCPI rises with increasing look ahead distance.

Reason – with increase in LA, effect of spatial locality diminishes.

The System Design

Worst and Best Values for Each System Parameter in terms of

effectiveness of Prefetching

system parameters are changed from -- worst value to its best value ---improvement in the performance of cache prefatching ranges from 7% to as high as 90%

RELATIVE AND ABSOLUTE MCPI FOR THE OPTIPREFETCH SYSTEM

All prefetching strategies perform better than when there is no prefetching.

The relative MCPI for all strategies averages 0.65 -- prefetching reduces MCPI by 35% on average relative to the corresponding baseline system.

OptiPrefatch system favors aggressive strategies like always and tag, which issues lot of prefatches.

Stall Breakdown in % for OPTIPREFETCH system.

There are no conflicts over the cache data ports as the data arrays are dual ported.

As 16 memory banks are available, conflicts over memory banks occur very rarely.

Average distribution of Prefetches

Baseline System

OPTIPREFETCH System

Due to high bus BW, most prefatch requests can be sent to the main memory successfully and almost no prefatches are aborted.

Performance of Cache Prefetching in Some Possible Systems

Possible System Design

Prefatching improves performance in all systems except system D and G as data bus BW is too small

Conclusion

Prefetching can reduce average memory latency provided

system has appropriately designed hardware.

For a cache to prefetch effectively

Cache Tag Arrays be double ported

Data Arrays either be double ported or buffered

Cache – At lease two way associative

Most effective

Cache size is large

Block size is small

Memory bus wide.

Split transaction bus

Interleaving main memory