Hierarchical Caching and Prefetching for Continuous Media Servers with Smart Disks
Informed Prefetching and Caching
description
Transcript of Informed Prefetching and Caching
Informed Prefetching and Caching
R. Hugo Patterson, Garth A. Gibson, Eka Ginting, Daniel
Stodolsky, Jim Zelenka
Contribution One of basic functions of file system:
Management of disk accesses Management of main-memory file buffers
Approach: Use hints from I/O-intensive applications to
prefetch aggressively enough to eliminate I/O stall time while maximizing buffer availability for caching
How to allocate cache buffers dynamically among competing hinting and non-hinting applications for the greatest performance benefit
Balance caching against prefetching
Distribute cache buffers among competing applications
Motivation Storage parallelism CPU I/O performance dependence Cache cache-hit ratios I/O intensive applications:
Amount of data processed >> file cache size Locality is poor or limited Frequently non-sequential accesses Large I/O stall time/total execution time Access patterns are largely predictable
How can I/O workloads be improved to take full advantage of the hardware that already exists?
ASAP: the four virtues of I/O workloads
Avoidance: not a scalable solution to the I/O bottleneck Sequentiality: scale for writes but not for reads Asynchrony: scalable through write buffering, scaling
for reads depends on prefetching aggressiveness Parallelism: scalable for explicitly parallel I/O requests;
but for serial workloads, scalable parallelisms achieved by scaling no. of asyn requests
Asynchrony eliminates write latency, and parallelism provides throughput.No existing techniques scalably relieve the I/O bottleneck for reads. Aggressive prefetching
Prefetching Aggressive prefetching for reads writing buffers
for writes
Hints Historical Information: LRU cache
replacement algorithm Sequential readahead: prefetching up to 64
blocks ahead when it detects long sequential runs
Disclosure: hints based on advance knowledge A mechanism for portable I/O optimizations Providing evidence for a policy decision Conforms to software engineering principles of
modularity
Informed Prefetching System: TIP-1 implemented in OSF/1, which has 2 I/O
optimizations Application: 5 I/O-intensive benchmarks single threaded,
data fetched from FS Hardware: DEC3000/500 workstation, 1 150 MHz 21064
processor, 128 MB RAM, 5 KZTSA fast SCSI-2 adapters, each hosting 3 HP2247 1GB disks, 12MB (1536 x 8KB) cache
Stripe unit: 64 KB Cluster prefetch: 5 prefetches. Disk scheduler: striper SCAN
512 buffers (1/3 cache)Unread hinted prefetch
LRUcount_unread_buffers--
Agrep Agrep woodworking 224_newsgroup_msg: 358
disk blocks
Read from the beginning to the end
Agrep (cont’d) Elapsed time for the sum of 4 searches is
reduced by up to 84%
Postgres Join of two relations
Outer relation: 20,000 unindexed tuples (3.2 MB)
Inner relation: 200,000 tuples (32 MB) and indexed (5 MB)
Output about 4,000 tuples written sequentially
Postgres (cont’d)
Postgres (cont’d) Elapsed time reduced by up to
55%
MCHF Davidson algorithm MCHF: A suite of computational-chemistry
programs used for atomic-structures calculations Davidson algorithm: an element of MCHF that
computes, by successive refinement, the extreme eigenvalue-eigenvector pairs of a large, sparse, real, symmetric matrix stored on disk
Matrix size: 17 MB The algorithm repeatedly accesses the same
large file sequentially.
MCHF Davidson algorithm (cont’d)
MCHF Davidson algorithm (cont’d) Hints disclose only sequential access in
one large file. OSF/1’s aggressive readahead performs
better than TIP-1. Neither OSF/1 nor informed prefetching
alone uses the 12 MB of cache buffers well.
LRU replacement algorithm flushes all of the blocks before any of them are reused.
Informed caching Goal: allocate cache buffers to minimize
application elapsed time Approach: estimate the impact on execution
time of alternative buffer allocations and then choose the best allocation
3 broad uses for each buffer: Caching recently used data in the traditional
LRU queue Prefetching data according to hints Caching data that a predictor indicates will
be reused in the future
Three uses of cache buffers
Difficult to estimate the performance of allocations at a global level
Cost-benefit analysis System model: from which the
various cost and benefit estimates are derived
Derivations: for each component Comparison: how to compare the
estimates at a global level to find the globally least valuable buffer and the globally most beneficial consumer
System assumptions Assumptions:
Modern OS with a file buffer cache running on a uniprocessor with sufficient memory to make available number of cache buffers
Workload emphasized on read-intensive applications
All application I/O accesses request a single file block that can be read in a single disk access and that the requests are not too bursty.
System parameters are constant. Enough parallelism, no congestion
System model/ /( )I O CPU I OT N T T
Elapsed time# I/O req. Avg time to service an I/O req.Avg app CPU time between
requests
/hit
I Ohit driver disk miss
TT
T T T T
Overhead: allocating of a buffer, queuing the request at the drive, and servicing the interrupt when the I/O completes
Cost of deallocating LRU bufferGiven ( ): cache-hit ratio, n: # of buffersH n
( ) ( ) (1 ( ))LRU hit missT n H n T H n T ( ) ( 1) ( ) ( )( )LRU LRU LRU miss hitT n T n T n H n T T
'()(1)() HnHnHn
The benefit of prefetching Prefetching a block can mask some of the latency of a disk
read, is the upper bound of the benefit of fetching a block.
If the prefetch can be delayed and still complete before it is needed, we consider there to be no benefit from starting the prefetch now.
diskT
diskT
prefetch consume
. . . . . . . . . . . . . . . . . . . .X requests
x
The benefit of prefetching the block that will be needed:
B ( )disk cpu hit driverT x T T T
x
Assume 0, 0
B / prefectch horizoncpu driver
disk hit disk hit
T T
T xT P T T
There is no benefit from prefetching further than P
The prefetch horizon
Comparison of LRU cost to prefetching benefit Shared resources: cache buffers Common currency: T/access = T/buffer
Rate of hinted accesses
( )( )xh d miss hitBr r H n T Tx
Rate of unhinted demand accesses
A buffer should be reallocated from the LRU cache for prefetching
The cost of flushing a hinted block
When should we flush a hinted block?
flush Hint access. . . . . . . . . . . . . . . . . . . .y accesses
prefetch back
Py-P
when
when 1
driver
flushdriver y
T y Py P
TT B
y P
Cost:
Putting it all together: global min-max
3 estimates: Which block should be replaced when a buffer is
needed for prefetching or to service a demand request? The globally least valuable block in the cache.
Should a cache buffer be used to prefetch data now? Prefetch if the expected benefit is greater than the expected cost of flushing or stealing the least valuable block.
, ,LRU x flushT B T
Separate estimators for LRU cache and for each independent Stream of hints
Value estimators LRU cache: i- th position if the LRU queue
Hint estimators:
Global value=max(value_LRU,value_hint) Globally least valuable block = min(global value)
Value ( )( )miss hitH i T T
when value
when 1
driver
driver y
T y Py PT B
y P
A global min-max valuation of blocks
Informed caching example: MRU
The informed cache manager discovers MRU caching without being specifically coded to implement this policy.
Implementation of informed caching and prefetch
Implementation of informed caching and prefetch(cont’d)
Performance improvement by informed caching
Balance contention
Future work Richer hint languages to disclosure
future accesses Strategies for dealing with
imprecise but still useful hints Cost-benefit model adapted to
non-uniform bandwidths Extensibility, e.g.: VM estimator to
track VM pages