Informed Prefetching and Caching

33
Informed Prefetching and Caching R. Hugo Patterson, Garth A. Gibson, Eka Ginting, Daniel Stodolsky, Jim Zelenka

description

Informed Prefetching and Caching. R. Hugo Patterson, Garth A. Gibson, Eka Ginting, Daniel Stodolsky, Jim Zelenka. Balance caching against prefetching. Distribute cache buffers among competing applications. Contribution. One of basic functions of file system: Management of disk accesses - PowerPoint PPT Presentation

Transcript of Informed Prefetching and Caching

Page 1: Informed Prefetching and Caching

Informed Prefetching and Caching

R. Hugo Patterson, Garth A. Gibson, Eka Ginting, Daniel

Stodolsky, Jim Zelenka

Page 2: Informed Prefetching and Caching

Contribution One of basic functions of file system:

Management of disk accesses Management of main-memory file buffers

Approach: Use hints from I/O-intensive applications to

prefetch aggressively enough to eliminate I/O stall time while maximizing buffer availability for caching

How to allocate cache buffers dynamically among competing hinting and non-hinting applications for the greatest performance benefit

Balance caching against prefetching

Distribute cache buffers among competing applications

Page 3: Informed Prefetching and Caching

Motivation Storage parallelism CPU I/O performance dependence Cache cache-hit ratios I/O intensive applications:

Amount of data processed >> file cache size Locality is poor or limited Frequently non-sequential accesses Large I/O stall time/total execution time Access patterns are largely predictable

How can I/O workloads be improved to take full advantage of the hardware that already exists?

Page 4: Informed Prefetching and Caching

ASAP: the four virtues of I/O workloads

Avoidance: not a scalable solution to the I/O bottleneck Sequentiality: scale for writes but not for reads Asynchrony: scalable through write buffering, scaling

for reads depends on prefetching aggressiveness Parallelism: scalable for explicitly parallel I/O requests;

but for serial workloads, scalable parallelisms achieved by scaling no. of asyn requests

Asynchrony eliminates write latency, and parallelism provides throughput.No existing techniques scalably relieve the I/O bottleneck for reads. Aggressive prefetching

Page 5: Informed Prefetching and Caching

Prefetching Aggressive prefetching for reads writing buffers

for writes

Page 6: Informed Prefetching and Caching

Hints Historical Information: LRU cache

replacement algorithm Sequential readahead: prefetching up to 64

blocks ahead when it detects long sequential runs

Disclosure: hints based on advance knowledge A mechanism for portable I/O optimizations Providing evidence for a policy decision Conforms to software engineering principles of

modularity

Page 7: Informed Prefetching and Caching

Informed Prefetching System: TIP-1 implemented in OSF/1, which has 2 I/O

optimizations Application: 5 I/O-intensive benchmarks single threaded,

data fetched from FS Hardware: DEC3000/500 workstation, 1 150 MHz 21064

processor, 128 MB RAM, 5 KZTSA fast SCSI-2 adapters, each hosting 3 HP2247 1GB disks, 12MB (1536 x 8KB) cache

Stripe unit: 64 KB Cluster prefetch: 5 prefetches. Disk scheduler: striper SCAN

512 buffers (1/3 cache)Unread hinted prefetch

LRUcount_unread_buffers--

Page 8: Informed Prefetching and Caching

Agrep Agrep woodworking 224_newsgroup_msg: 358

disk blocks

Read from the beginning to the end

Page 9: Informed Prefetching and Caching

Agrep (cont’d) Elapsed time for the sum of 4 searches is

reduced by up to 84%

Page 10: Informed Prefetching and Caching

Postgres Join of two relations

Outer relation: 20,000 unindexed tuples (3.2 MB)

Inner relation: 200,000 tuples (32 MB) and indexed (5 MB)

Output about 4,000 tuples written sequentially

Page 11: Informed Prefetching and Caching

Postgres (cont’d)

Page 12: Informed Prefetching and Caching

Postgres (cont’d) Elapsed time reduced by up to

55%

Page 13: Informed Prefetching and Caching

MCHF Davidson algorithm MCHF: A suite of computational-chemistry

programs used for atomic-structures calculations Davidson algorithm: an element of MCHF that

computes, by successive refinement, the extreme eigenvalue-eigenvector pairs of a large, sparse, real, symmetric matrix stored on disk

Matrix size: 17 MB The algorithm repeatedly accesses the same

large file sequentially.

Page 14: Informed Prefetching and Caching

MCHF Davidson algorithm (cont’d)

Page 15: Informed Prefetching and Caching

MCHF Davidson algorithm (cont’d) Hints disclose only sequential access in

one large file. OSF/1’s aggressive readahead performs

better than TIP-1. Neither OSF/1 nor informed prefetching

alone uses the 12 MB of cache buffers well.

LRU replacement algorithm flushes all of the blocks before any of them are reused.

Page 16: Informed Prefetching and Caching

Informed caching Goal: allocate cache buffers to minimize

application elapsed time Approach: estimate the impact on execution

time of alternative buffer allocations and then choose the best allocation

3 broad uses for each buffer: Caching recently used data in the traditional

LRU queue Prefetching data according to hints Caching data that a predictor indicates will

be reused in the future

Page 17: Informed Prefetching and Caching

Three uses of cache buffers

Difficult to estimate the performance of allocations at a global level

Page 18: Informed Prefetching and Caching

Cost-benefit analysis System model: from which the

various cost and benefit estimates are derived

Derivations: for each component Comparison: how to compare the

estimates at a global level to find the globally least valuable buffer and the globally most beneficial consumer

Page 19: Informed Prefetching and Caching

System assumptions Assumptions:

Modern OS with a file buffer cache running on a uniprocessor with sufficient memory to make available number of cache buffers

Workload emphasized on read-intensive applications

All application I/O accesses request a single file block that can be read in a single disk access and that the requests are not too bursty.

System parameters are constant. Enough parallelism, no congestion

Page 20: Informed Prefetching and Caching

System model/ /( )I O CPU I OT N T T

Elapsed time# I/O req. Avg time to service an I/O req.Avg app CPU time between

requests

/hit

I Ohit driver disk miss

TT

T T T T

Overhead: allocating of a buffer, queuing the request at the drive, and servicing the interrupt when the I/O completes

Page 21: Informed Prefetching and Caching

Cost of deallocating LRU bufferGiven ( ): cache-hit ratio, n: # of buffersH n

( ) ( ) (1 ( ))LRU hit missT n H n T H n T ( ) ( 1) ( ) ( )( )LRU LRU LRU miss hitT n T n T n H n T T

'()(1)() HnHnHn

Page 22: Informed Prefetching and Caching

The benefit of prefetching Prefetching a block can mask some of the latency of a disk

read, is the upper bound of the benefit of fetching a block.

If the prefetch can be delayed and still complete before it is needed, we consider there to be no benefit from starting the prefetch now.

diskT

diskT

prefetch consume

. . . . . . . . . . . . . . . . . . . .X requests

x

The benefit of prefetching the block that will be needed:

B ( )disk cpu hit driverT x T T T

x

Assume 0, 0

B / prefectch horizoncpu driver

disk hit disk hit

T T

T xT P T T

There is no benefit from prefetching further than P

Page 23: Informed Prefetching and Caching

The prefetch horizon

Page 24: Informed Prefetching and Caching

Comparison of LRU cost to prefetching benefit Shared resources: cache buffers Common currency: T/access = T/buffer

Rate of hinted accesses

( )( )xh d miss hitBr r H n T Tx

Rate of unhinted demand accesses

A buffer should be reallocated from the LRU cache for prefetching

Page 25: Informed Prefetching and Caching

The cost of flushing a hinted block

When should we flush a hinted block?

flush Hint access. . . . . . . . . . . . . . . . . . . .y accesses

prefetch back

Py-P

when

when 1

driver

flushdriver y

T y Py P

TT B

y P

Cost:

Page 26: Informed Prefetching and Caching

Putting it all together: global min-max

3 estimates: Which block should be replaced when a buffer is

needed for prefetching or to service a demand request? The globally least valuable block in the cache.

Should a cache buffer be used to prefetch data now? Prefetch if the expected benefit is greater than the expected cost of flushing or stealing the least valuable block.

, ,LRU x flushT B T

Separate estimators for LRU cache and for each independent Stream of hints

Page 27: Informed Prefetching and Caching

Value estimators LRU cache: i- th position if the LRU queue

Hint estimators:

Global value=max(value_LRU,value_hint) Globally least valuable block = min(global value)

Value ( )( )miss hitH i T T

when value

when 1

driver

driver y

T y Py PT B

y P

A global min-max valuation of blocks

Page 28: Informed Prefetching and Caching

Informed caching example: MRU

The informed cache manager discovers MRU caching without being specifically coded to implement this policy.

Page 29: Informed Prefetching and Caching

Implementation of informed caching and prefetch

Page 30: Informed Prefetching and Caching

Implementation of informed caching and prefetch(cont’d)

Page 31: Informed Prefetching and Caching

Performance improvement by informed caching

Page 32: Informed Prefetching and Caching

Balance contention

Page 33: Informed Prefetching and Caching

Future work Richer hint languages to disclosure

future accesses Strategies for dealing with

imprecise but still useful hints Cost-benefit model adapted to

non-uniform bandwidths Extensibility, e.g.: VM estimator to

track VM pages