Foundation Usenix08 Talk

download Foundation Usenix08 Talk

of 39

Transcript of Foundation Usenix08 Talk

  • 8/4/2019 Foundation Usenix08 Talk

    1/39

    Fast, Inexpensive Content-

    Addressed Storage in Foundation

    Sean Rhea* Russ Cox, Alex Pesterev*

    Meraki, Inc. MIT CSAIL

    *Work done while at Intel Research, Berkeley.

  • 8/4/2019 Foundation Usenix08 Talk

    2/39

  • 8/4/2019 Foundation Usenix08 Talk

    3/39

    As a community, were not bad at storing

    important data over the long term.

    Weve only just begun to think about how

    well interpret that data 30 years from now.

  • 8/4/2019 Foundation Usenix08 Talk

    4/39

    For Example

    Viewing an old PowerPoint presentation

    Do we still have PowerPoint at all? And Windows?

    Does the presentation use non-standard fonts/codecs?

    Has some newer application overwritten a sharedlibrary with an incompatible version (DLL Hell)?

    Not just a Microsoft problem: consider a web page

    Even current IE/Safari/Firefox dont agree on formatting

    All kinds of plugins necessary: sound, video, Flash

  • 8/4/2019 Foundation Usenix08 Talk

    5/39

    The Foundation Idea

    Make daily backups of entiresoftware stack

    Archives users applications, OS, and configuration state

    Dont worry about identifying dependencies Just save it all: Every byte, every night

    To recover an obscure file, boot the relevant stackin an emulator

    View file with the application that created it

  • 8/4/2019 Foundation Usenix08 Talk

    6/39

    Foundation FAQ

    Why preserve the entiredisk? Preserve software stack dependencies: preserve the data with the

    right application, libraries, and operating system as a singleunit

    Works for allapplications, not just ones designed for preservation

    Why dailyimages? Want to preserve machine state as close as possible to last write of

    users data (i.e., preserve image before something changes)

    Also allows recovery from user errors

    Why emulate hardware? Much better track record than emulating software

    Software example: OpenOffice emulating Microsoft Word (yikes)

    Hardware emulators available today for Amiga, PDP-11, Nintendo

  • 8/4/2019 Foundation Usenix08 Talk

    7/39

    I would love to give a talk about whyFoundation is a great solution to the

    digital preservation problem.

    Really, though, I think its just a prettygood start.

    Instead, Im going to talk about a funproblem we had to solve to make it work.

  • 8/4/2019 Foundation Usenix08 Talk

    8/39

    Every Byte, Every Night?Indefinitely? Really?

    Plan 9 did exactly that Archive changed blocks every night to optical jukebox

    Found that storage capacity grew faster than usage

    Later with Content-Addressable Storage (Venti) Automatically coalesces duplicate data to save space

    Required multiple, high-speed disks for performance

    Challenge for Foundation: provide similar storageefficiency on consumer hardware Time Machine model: one external USB drive

  • 8/4/2019 Foundation Usenix08 Talk

    9/39

    Talk Outline

    Introduction

    What is Foundation?

    Review of Content-Addressed Storage (Venti)

    Contributions

    Making CheapContent-Addressed Storage Fast

    Avoiding Concerns over Hash Collisions

    Related Work

    Conclusions

  • 8/4/2019 Foundation Usenix08 Talk

    10/39

    Venti Review

    Plan 9 file system was two-level Spinning storage, mostly a normal file system

    Archival storage, optical write-once jukebox

    Venti replaced optical jukebox Still write-once

    Chunks of data named by their SHA-1 hashesContent-Addressable Storage (CAS)

    Automatically coalesces duplicate writes

  • 8/4/2019 Foundation Usenix08 Talk

    11/39

    5:h( )16:7:8:9:

    h( )2

    reads 1st blockreads 2nd block

    Users Hard Drive External USB Drive

    Hash Offset

    Data Log

    seen it before?

    0:1:2:3:h( )04:

    RAM

    ArchivalProcess

    Summary

    h( )

    appendto log

    update index

    appendhash to

    summary,h( ),h( )

    reads 4th block

    no logwrite!

    h( )

    ,

    Venti Review

  • 8/4/2019 Foundation Usenix08 Talk

    12/39

    Venti Review

    Users Hard Drive External USB Drive

    Hash Offset

    Data Log

    0:h( )

    41:2:h( )33:h( )04:h( )7

    5:h( )16:h( )67:h( )58:h( )29:

    RAM

    Summary

    h( ), h( ), h( ),h( ), h( ), h( ),h( ), h( ), h( ),h( ), h( ), h( ),h( ), h( ), h( )

    RestoreProcess

    lookup hashof 1st block

    map hash to log offset

    read blockfrom log

    restore block

    Crash!

    Final step (not shown): archivesummary in data log as well

  • 8/4/2019 Foundation Usenix08 Talk

    13/39

    Notes on Venti

    The Good News: CAS stores each block with particular contents only once

    Changing any one block and re-archiving uses only onemore block in archive

    Adding a duplicate file from a different source uses no

    additional storage

    The Bad News: Synchronous, random reads to on-disk index

  • 8/4/2019 Foundation Usenix08 Talk

    14/39

    reads 4th block

    Users Hard Drive External USB Drive

    Hash Offset

    Data Log

    seen it before?

    0:1:2:3:h( )04:

    5:h( )16:7:8:9:

    RAM

    ArchivalProcess

    Summary

    h( ),h( ),h( )

    h( )2

    Venti Review

    Have to seek to theright bucket

  • 8/4/2019 Foundation Usenix08 Talk

    15/39

    Venti Review

    Users Hard Drive External USB Drive

    Hash Offset

    Data Log

    RAM

    Summary

    h( ), h( ), h( ),h( ), h( ), h( ),h( ), h( ), h( ),h( ), h( ), h( ),h( ), h( ), h( )

    RestoreProcess

    lookup hashof 1st block

    map hash to log offset

    0:h( )41:2:h( )33:h( )04:h( )7

    5:h( )16:h( )67:h( )58:h( )29:

    Have to seek to theright bucket

  • 8/4/2019 Foundation Usenix08 Talk

    16/39

    Notes on Venti

    The Good News: CAS stores each block with particular contents only once

    Changing any one block and re-archiving uses only onemore block in archive

    Adding a duplicate file from a different source uses no

    additional storage

    The Bad News: Synchronous, random reads to on-disk index

    Best case, one-disk performance for 512-byte blocks:one 5 ms seek per 512 bytes archived = 100 kB/s

    Thats 12 days to archive a 100 GB disk!

    Larger blocks give better throughput, less sharing

  • 8/4/2019 Foundation Usenix08 Talk

    17/39

    Notes on Venti (cont.)

    Ventis solution: use 8 high-speed disks for index Untennable in consumer space

    Wears disks out pretty quickly, too

    The compare-by-hash controversy: Fear of hash collisions: two different blocks with same

    hash breaks Venti

    May be very unlikely, but cost (data corruption) is huge

    Does CAS really require a cryptographically strong hash?

  • 8/4/2019 Foundation Usenix08 Talk

    18/39

    Talk Outline

    Introduction

    What is Foundation?

    Review of Content-Addressed Storage (Venti)

    Contributions

    Making CheapContent-Addressed Storage Fast

    Avoiding Concerns over Hash Collisions

    Related Work

    Conclusions

  • 8/4/2019 Foundation Usenix08 Talk

    19/39

    Making Inexpensive CAS Fast

    The problem: disk seeks

    Secure hash randomizes an otherwise sequential disk-to-disk transfer

    To reduce seeks, must reduce hash table lookups

    When do hash table lookups occur?

    1. When writing data, to determine if weve seen it before

    2. When writing data, to update the index

    3. When reading data, to map hashes to disk locations

  • 8/4/2019 Foundation Usenix08 Talk

    20/39

    2. Updating the Index

    After appending a block to the data log,must update the index

    Psuedorandom hash causes a seek

  • 8/4/2019 Foundation Usenix08 Talk

    21/39

    Users Hard Drive External USB Drive

    Hash Offset

    Data Log

    0:1:2:3:h( )04:

    5:h( )16:7:8:9:

    RAM

    ArchivalProcess

    Summary

    h( )

    appendto log

    update indexUpdating the Index

    Have to seek to theright bucket

    reads 2nd block

  • 8/4/2019 Foundation Usenix08 Talk

    22/39

    2. Updating the Index

    After appending a block to the data log,must update the index

    Psuedorandom hash causes a seek

    Easy to fix: use a write-back index cache

    Store index writes in memory

    Flush to disk sequentially in large batches

    On crash, reconstruct index from the data log

  • 8/4/2019 Foundation Usenix08 Talk

    23/39

    3. Mapping Hashes to DiskLocations During Reads

    To restore disk

    Start with the list of original blocks hashes

    Lookup each block in index

    Read block from data log and restore to disk

  • 8/4/2019 Foundation Usenix08 Talk

    24/39

    Users Hard Drive External USB Drive

    Hash Offset

    Data Log

    RAM

    Summary

    h( ), h( ), h( ),h( ), h( ), h( ),h( ), h( ), h( ),h( ), h( ), h( ),h( ), h( ), h( )

    RestoreProcess

    lookup hashof 1st block

    map hash to log offset

    0:h( )41:2:h( )33:h( )04:h( )7

    5:h( )16:h( )67:h( )58:h( )29:

    Have to seek to theright bucket

  • 8/4/2019 Foundation Usenix08 Talk

    25/39

  • 8/4/2019 Foundation Usenix08 Talk

    26/39

  • 8/4/2019 Foundation Usenix08 Talk

    27/39

    3. Mapping Hashes to DiskLocations During Reads

    To restore disk

    Start with the list of original blocks hashes

    Lookup each block in index

    Read block from data log and restore to disk

    Observation: data log is mostlyordered

    Duplicate blocks often occur as part of duplicate files

    Idea: add another index, ordered by log offset

    Read-ahead in this index to eliminate future lookupsin original index

  • 8/4/2019 Foundation Usenix08 Talk

    28/39

    Offset Hash0:h( )1:h( )2:h( )3:h( )4:h( )

    5:h( )

    6:h( )7:h( )8:9:

    10:

    11:

    read blockfrom log(seek!)

    read blockfrom log

    (no seek!)

    Index by Offset

    Users Hard Drive External USB Drive

    Hash Offset

    Data Log

    RAM

    Summary

    h( ), h( ), h( ),

    h( ), h( ), h( ),h( ), h( ), h( ),h( ), h( ), h( ),h( ), h( ), h( )

    RestoreProcess

    lookup hashof 1st block

    map hash to log offset (seek!)

    Crash!

    Hash Offsetprefetch hashes

    for next fewoffsets from

    secondary index(seek!)

    new index,sorted by offset

    h( )0h( )1h( )2

    h( )3h( )4

    restore block 0:h( )41:2:h( )33:h( )04:h( )7

    5:h( )16:h( )67:h( )58:h( )29:

    lookup hashof 2nd block

    find log offset

    in secondaryindex no seek!

  • 8/4/2019 Foundation Usenix08 Talk

    29/39

    1. Is a Block New, or Duplicate?

    Optimization for reads also helps duplicate writes Index misses on first duplicate block

    Hits on subsequent blocks rewritten in same order

    Doesnt help for new data Every lookup in primary index fails

    Still suffer a seek for every new block

  • 8/4/2019 Foundation Usenix08 Talk

    30/39

    1. Is a Block New, or Duplicate?

    Idea: use a Bloom filter to identify new blocks

    Lossy representation of the primary index

    Uses much less memory than index itself

    For any given block, Bloom filter tells us:

    Its definitely new append to log, update index

    It might be duplicate lookup in index

    If it really is a duplicate, we get the prefetch benefit

    Otherwise, called a false positive

    Using enough memory keeps false positives at ~1%

  • 8/4/2019 Foundation Usenix08 Talk

    31/39

    Results

    Do these optimizations pay off? Buffering index writes is an obvious win Bloom filter is, too: removes 99% of seeks when

    writing new data

    Both trade RAM for seeks

    Benefit of secondary index less clear If duplicate data comes in long sequences, it reduces

    index seeks to two per sequence If duplicate data comes in little fragments, it doubles

    the number of index seeks Need traces of real data to answer this question

  • 8/4/2019 Foundation Usenix08 Talk

    32/39

    Results (cont.)

    Research group at MIT has been running Ventias its backup server for two years

    We looked at 400 nightly snapshots

    Simulated archiving and restoring these in both Ventiand Foundation

    Venti Foundation

    Average archival speed < 1 MB/s 20.1 MB/s

    % time spent seeking 96% 10%

    Average restore speed 1.2 MB/s 13.6 MB/s

    % time spent seeking 95% 58%

  • 8/4/2019 Foundation Usenix08 Talk

    33/39

    Talk Outline

    Introduction

    What is Foundation?

    Review of Content-Addressed Storage (Venti)

    Contributions

    Making CheapContent-Addressed Storage Fast

    Avoiding Concerns over Hash Collisions

    Related Work

    Conclusions

  • 8/4/2019 Foundation Usenix08 Talk

    34/39

    Eliminating Compare by Hash

    Some worried that same SHA-1 doesnt implysame contents (i.e., hash collisions are possible) Even if very rare, consequences (corruption) too great

    Stepping back a bit, CAS as a black box: Give it a data block, get back an opaque ID Give it an opaque ID, get back the data block

    Do we care that the ID is a SHA-1 hash? What if the opaque ID was just the blocks location

    in the data log?

  • 8/4/2019 Foundation Usenix08 Talk

    35/39

  • 8/4/2019 Foundation Usenix08 Talk

    36/39

    2nd Disk Arm to the Rescue

    Once we eliminate most index reads (via ourprevious optimizations), the backup disk isotherwise idle while backing up duplicate data

    Can instead put it to work doing byte-by-bytecomparisons of suspected duplicates

    Foundation

    Venti By Hash By ValueArchival < 1 MB/s 20.1 MB/s 15.4 MB/s

    Restore 1.2 MB/s 13.6 MB/s 15.0 MB/s

  • 8/4/2019 Foundation Usenix08 Talk

    37/39

    Talk Outline

    Introduction

    What is Foundation?

    Review of Content-Addressed Storage (Venti)

    Contributions

    Making CheapContent-Addressed Storage Fast

    Avoiding Concerns over Hash Collisions

    Related Work

    Conclusions

  • 8/4/2019 Foundation Usenix08 Talk

    38/39

  • 8/4/2019 Foundation Usenix08 Talk

    39/39

    Conclusions

    Consumer-grade CAS works now

    A single, external USB drive is enough

    Just have to be crafty about avoiding seeks

    Lots of uses other than preservation

    E.g., inexpensive household backup server thatautomatically coalesces duplicate media collections

    Doesnt require a collision-free hash function