Download - Informed Prefetching and Caching

Transcript
Page 1: Informed Prefetching and Caching

Informed Prefetching and Caching

R. Hugo Patterson, Garth A. Gibson, Eka Ginting, Daniel

Stodolsky, Jim Zelenka

Page 2: Informed Prefetching and Caching

Contribution One of basic functions of file system:

Management of disk accesses Management of main-memory file buffers

Approach: Use hints from I/O-intensive applications to

prefetch aggressively enough to eliminate I/O stall time while maximizing buffer availability for caching

How to allocate cache buffers dynamically among competing hinting and non-hinting applications for the greatest performance benefit

Balance caching against prefetching

Distribute cache buffers among competing applications

Page 3: Informed Prefetching and Caching

Motivation Storage parallelism CPU I/O performance dependence Cache cache-hit ratios I/O intensive applications:

Amount of data processed >> file cache size Locality is poor or limited Frequently non-sequential accesses Large I/O stall time/total execution time Access patterns are largely predictable

How can I/O workloads be improved to take full advantage of the hardware that already exists?

Page 4: Informed Prefetching and Caching

ASAP: the four virtues of I/O workloads

Avoidance: not a scalable solution to the I/O bottleneck Sequentiality: scale for writes but not for reads Asynchrony: scalable through write buffering, scaling

for reads depends on prefetching aggressiveness Parallelism: scalable for explicitly parallel I/O requests;

but for serial workloads, scalable parallelisms achieved by scaling no. of asyn requests

Asynchrony eliminates write latency, and parallelism provides throughput.No existing techniques scalably relieve the I/O bottleneck for reads. Aggressive prefetching

Page 5: Informed Prefetching and Caching

Prefetching Aggressive prefetching for reads writing buffers

for writes

Page 6: Informed Prefetching and Caching

Hints Historical Information: LRU cache

replacement algorithm Sequential readahead: prefetching up to 64

blocks ahead when it detects long sequential runs

Disclosure: hints based on advance knowledge A mechanism for portable I/O optimizations Providing evidence for a policy decision Conforms to software engineering principles of

modularity

Page 7: Informed Prefetching and Caching

Informed Prefetching System: TIP-1 implemented in OSF/1, which has 2 I/O

optimizations Application: 5 I/O-intensive benchmarks single threaded,

data fetched from FS Hardware: DEC3000/500 workstation, 1 150 MHz 21064

processor, 128 MB RAM, 5 KZTSA fast SCSI-2 adapters, each hosting 3 HP2247 1GB disks, 12MB (1536 x 8KB) cache

Stripe unit: 64 KB Cluster prefetch: 5 prefetches. Disk scheduler: striper SCAN

512 buffers (1/3 cache)Unread hinted prefetch

LRUcount_unread_buffers--

Page 8: Informed Prefetching and Caching

Agrep Agrep woodworking 224_newsgroup_msg: 358

disk blocks

Read from the beginning to the end

Page 9: Informed Prefetching and Caching

Agrep (cont’d) Elapsed time for the sum of 4 searches is

reduced by up to 84%

Page 10: Informed Prefetching and Caching

Postgres Join of two relations

Outer relation: 20,000 unindexed tuples (3.2 MB)

Inner relation: 200,000 tuples (32 MB) and indexed (5 MB)

Output about 4,000 tuples written sequentially

Page 11: Informed Prefetching and Caching

Postgres (cont’d)

Page 12: Informed Prefetching and Caching

Postgres (cont’d) Elapsed time reduced by up to

55%

Page 13: Informed Prefetching and Caching

MCHF Davidson algorithm MCHF: A suite of computational-chemistry

programs used for atomic-structures calculations Davidson algorithm: an element of MCHF that

computes, by successive refinement, the extreme eigenvalue-eigenvector pairs of a large, sparse, real, symmetric matrix stored on disk

Matrix size: 17 MB The algorithm repeatedly accesses the same

large file sequentially.

Page 14: Informed Prefetching and Caching

MCHF Davidson algorithm (cont’d)

Page 15: Informed Prefetching and Caching

MCHF Davidson algorithm (cont’d) Hints disclose only sequential access in

one large file. OSF/1’s aggressive readahead performs

better than TIP-1. Neither OSF/1 nor informed prefetching

alone uses the 12 MB of cache buffers well.

LRU replacement algorithm flushes all of the blocks before any of them are reused.

Page 16: Informed Prefetching and Caching

Informed caching Goal: allocate cache buffers to minimize

application elapsed time Approach: estimate the impact on execution

time of alternative buffer allocations and then choose the best allocation

3 broad uses for each buffer: Caching recently used data in the traditional

LRU queue Prefetching data according to hints Caching data that a predictor indicates will

be reused in the future

Page 17: Informed Prefetching and Caching

Three uses of cache buffers

Difficult to estimate the performance of allocations at a global level

Page 18: Informed Prefetching and Caching

Cost-benefit analysis System model: from which the

various cost and benefit estimates are derived

Derivations: for each component Comparison: how to compare the

estimates at a global level to find the globally least valuable buffer and the globally most beneficial consumer

Page 19: Informed Prefetching and Caching

System assumptions Assumptions:

Modern OS with a file buffer cache running on a uniprocessor with sufficient memory to make available number of cache buffers

Workload emphasized on read-intensive applications

All application I/O accesses request a single file block that can be read in a single disk access and that the requests are not too bursty.

System parameters are constant. Enough parallelism, no congestion

Page 20: Informed Prefetching and Caching

System model/ /( )I O CPU I OT N T T

Elapsed time# I/O req. Avg time to service an I/O req.Avg app CPU time between

requests

/hit

I Ohit driver disk miss

TT

T T T T

Overhead: allocating of a buffer, queuing the request at the drive, and servicing the interrupt when the I/O completes

Page 21: Informed Prefetching and Caching

Cost of deallocating LRU bufferGiven ( ): cache-hit ratio, n: # of buffersH n

( ) ( ) (1 ( ))LRU hit missT n H n T H n T ( ) ( 1) ( ) ( )( )LRU LRU LRU miss hitT n T n T n H n T T

'()(1)() HnHnHn

Page 22: Informed Prefetching and Caching

The benefit of prefetching Prefetching a block can mask some of the latency of a disk

read, is the upper bound of the benefit of fetching a block.

If the prefetch can be delayed and still complete before it is needed, we consider there to be no benefit from starting the prefetch now.

diskT

diskT

prefetch consume

. . . . . . . . . . . . . . . . . . . .X requests

x

The benefit of prefetching the block that will be needed:

B ( )disk cpu hit driverT x T T T

x

Assume 0, 0

B / prefectch horizoncpu driver

disk hit disk hit

T T

T xT P T T

There is no benefit from prefetching further than P

Page 23: Informed Prefetching and Caching

The prefetch horizon

Page 24: Informed Prefetching and Caching

Comparison of LRU cost to prefetching benefit Shared resources: cache buffers Common currency: T/access = T/buffer

Rate of hinted accesses

( )( )xh d miss hitBr r H n T Tx

Rate of unhinted demand accesses

A buffer should be reallocated from the LRU cache for prefetching

Page 25: Informed Prefetching and Caching

The cost of flushing a hinted block

When should we flush a hinted block?

flush Hint access. . . . . . . . . . . . . . . . . . . .y accesses

prefetch back

Py-P

when

when 1

driver

flushdriver y

T y Py P

TT B

y P

Cost:

Page 26: Informed Prefetching and Caching

Putting it all together: global min-max

3 estimates: Which block should be replaced when a buffer is

needed for prefetching or to service a demand request? The globally least valuable block in the cache.

Should a cache buffer be used to prefetch data now? Prefetch if the expected benefit is greater than the expected cost of flushing or stealing the least valuable block.

, ,LRU x flushT B T

Separate estimators for LRU cache and for each independent Stream of hints

Page 27: Informed Prefetching and Caching

Value estimators LRU cache: i- th position if the LRU queue

Hint estimators:

Global value=max(value_LRU,value_hint) Globally least valuable block = min(global value)

Value ( )( )miss hitH i T T

when value

when 1

driver

driver y

T y Py PT B

y P

A global min-max valuation of blocks

Page 28: Informed Prefetching and Caching

Informed caching example: MRU

The informed cache manager discovers MRU caching without being specifically coded to implement this policy.

Page 29: Informed Prefetching and Caching

Implementation of informed caching and prefetch

Page 30: Informed Prefetching and Caching

Implementation of informed caching and prefetch(cont’d)

Page 31: Informed Prefetching and Caching

Performance improvement by informed caching

Page 32: Informed Prefetching and Caching

Balance contention

Page 33: Informed Prefetching and Caching

Future work Richer hint languages to disclosure

future accesses Strategies for dealing with

imprecise but still useful hints Cost-benefit model adapted to

non-uniform bandwidths Extensibility, e.g.: VM estimator to

track VM pages