Pastiche: Making Backup Cheap and Easy Presented by: Boon Thau Loo CS294-4.
-
date post
22-Dec-2015 -
Category
Documents
-
view
218 -
download
5
Transcript of Pastiche: Making Backup Cheap and Easy Presented by: Boon Thau Loo CS294-4.
Outline Motivation and Goals Enabling Technologies System Design Implementation and Evaluation Conclusion
Motivation Majority of users do not backup their
data. Those who do Don’t backup very often. Don’t backup everything.
Backup is a significant cost in large organizations.
Why not use excess disk space for backups? File systems are only half-full on average. Disks are cheap.
Pastiche Goals P2P Backup System Target environment
Cooperation though untrusted machines. End-user machines
Leverage common data when possible for space efficiency (backup “buddies”).
Preserve privacy. Efficient, cost-free, administrative-free
Enabling Technologies Pastry for self-organizing routing and object
location. Content-based Indexing (Manber94, LBFS)
Identify boundary regions (anchors) that divide file into chunks
Rabin fingerprinting Isolate changes in each chunk. SHA-1 hash of each chunk
Convergent encryption (used by FARSITE) Encrypt file using key derived from file’s contents. Further encrypt using client’s key. Encrypted key is stored with file in FARSITE
System Design Data Chunks File meta-data Abstracts Joining Pastry Finding Backup buddies Backup Protocol Restoration Failures and Malicious Nodes Greed Prevention
Data Chunks Data is stored on disk as immutable chunks.
Content-based indexing + convergent encryption Chunks are stored for local host and/or on
backup clients. Each chunk carries owner lists and maintains
reference count. When a newly written file is closed, it is
scheduled for chunking:Hc – Handle
Ic – Chunk ID
Kc – Encryption key
Chunk ID list forms file signature.
Data Chunks (Cont…) Backup Request:
Remote hosts must supply public key with backup request.
If chunk exist, add requesting host to owner list. Local reference count is incremented.
Delete Request: Requests from remote hosts must be signed by secret
key. Check against public key (cached from earlier backup
request) When reference count = 0, chunk is removed.
File Meta-data File meta-data
List of handles Hc for chunks comprising the file.
Ownership, permissions, creation and modification times.
Mutable with fixed Hc, Kc and Ic File system root meta-data: Hc
generated based on host-specific passphrase.
Abstracts Initial backup of a freshly installed
machine is most expensive. Goal: Find a good buddy that owns all or
most of your data chunks. Naïve solution: Ship full signature of new
node around. Expensive: 20 bytes per chunk for a 16KB
chunk. Solution: Send a random subset of
signatures called an abstract.
Joining Pastry Pastry:
Self-organizing, p2p overlay Each node maintains
• Leaf set: L/2 closest smaller (larger) nodeIDs• Neighborhood set: Closest nodes according to
proximity metric• Routing table: Prefix routing
Join Pastry overlay with nodeID set to Hash(hostname)
Find backup buddies…
Finding Backup Buddies After joining network, route Pastry message with
abstract to a random nodeID. Each node along the route returns its coverage
(fraction of chunks in abstract stored locally) with the abstract
Lighthouse sweep: Rotating probe process repeated if there are insufficient candidate set by varying first digit of original nodeID
Not Enough Buddies? Each node tries to find 5 buddies. What if you can’t find enough buddies? Real possibility for rare installations Create coverage-rate Pastry overlay
Replace network proximity distance metric with coverage-rate.
Pastry neighbor set: set of nodes encountered during join with best coverage available.
Find buddies in the neighborhood set A is a buddy for B, but may not vice versa (no
symmetry) Possibility of malicious nodes to misreport
coverage.
Backup Protocol Each Pastiche node controls its own
archival plan. Snapshot: a discrete backup event. Meta-data skeleton for each snapshot stored
on per-file logs. State necessary for new snapshot: Add set,
delete set, meta-data list
Backup Protocol (Cont..) Snapshot process (A stores snapshot on
B): A sends public key to B (for future validation) A forwards chunkIDs of add set to B. B fetch chunks not already stored locally. A sends delete list (signed with A’s private
key) A sends updated meta-data. A sends commit request, B responds when all
changes are persistent.
Restoration Partial restores is straightforward. Obtain
chunks from buddy. Recover entire machine
Keep copy of root meta-data object in each member of leaf set.
Rejoin with same nodeID (based on hostname) Retrieve root meta-data object from any node
in leaf set. Root block contain list of buddies.
Detecting Failure and Malice Failures:
Buddy can drop chunks if it runs out of disk space. Buddy may crash or leave the network. Malicious buddy may pretend to store your chunks.
Solutions: Before taking a new snapshot, query buddies for
random subset of chunks. Provides instantaneous assurance.
Periodic probing of buddy: Analysis shows that checking 0.1% of all chunks is enough.
Sybil attack? Malicious party occupy substantial fraction of nodeID space.
Greed Prevention Greedy host can consumes storage. Three solutions:
Group backup clients based on resources consumed.
Cryptographic puzzles according to storage consumed.
Electronic currency • Currency accounting: requires atomicity
between exchange of currency and backup.
Implementation Chunkstore file system
Container files – LRU cache of decrypted, recently used files for performance.
Chunks increase internal fragmentation. Backup daemon
Server: Manages remote requests for storage and restoration.
Client: Supervises selection of buddies and snapshots.
Evaluation Compare ext2fs with chunkstore on modified Andrew benchmark:
Total overhead of 7.4% is reasonable.
Overheads due to meta-data management, and Rabin fingerprints computation (for finding anchors)
Backup and restore compares favorably to NFS cross-machine copy.
Conclusion: service does not penalize file system performance unduly.
Evaluation (Cont…) Question: How large must the abstract be?
Compare machines with a freshly installed machine
Abstract size does not seem to matter much.
Evaluation (Cont…) Question: How effective is the lighthouse sweep
in discovering buddies? Simulation: 50000 Pastiche nodes with 11 types of
nodes. Lighthouse is good enough for common nodes
(>=10%). Rare nodes would require coverage-rate overlay.
Evaluation (Cont…)
For a neighborhood size of 256, 85% were able to find at least one buddy. 72% found at least 5.
Neighborhood size matters!
Question: How effective is the coverage-rate overlay in discovering buddies? 10000 nodes 3 types of nodes
• One of a thousand species (Same species share 70% of content)• One of a hundred genera (30%)• One of ten orders (20%)
Only same species can back each other up.