EECS 262a Advanced Topics in Computer Systems Lecture 17 ...

17
EECS 262a Advanced Topics in Computer Systems Lecture 17 P2P Storage: Dynamo/Pond October 29 th , 2019 John Kubiatowicz Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~kubitron/cs262 10/29/2019 2 cs262a-F19 Lecture-17 Reprise: Stability under churn (Tapestry) (May 2003: 1.5 TB over 4 hours) DOLR Model generalizes to many simultaneous apps 10/29/2019 3 cs262a-F19 Lecture-17 Churn (Optional Bamboo paper last time) Authors Systems Observed Session Time SGG02 Gnutella, Napster 50% < 60 minutes CLL02 Gnutella, Napster 31% < 10 minutes SW02 FastTrack 50% < 1 minute BSV03 Overnet 50% < 60 minutes GDS03 Kazaa 50% < 2.4 minutes Chord is a “scalable protocol for lookup in a dynamic peer-to-peer system with frequent node arrivals and departures” -- Stoica et al., 2001 10/29/2019 4 cs262a-F19 Lecture-17 A Simple lookup Test Start up 1,000 DHT nodes on ModelNet network Emulates a 10,000-node, AS-level topology Unlike simulations, models cross traffic and packet loss Unlike PlanetLab, gives reproducible results Churn nodes at some rate Poisson arrival of new nodes Random node departs on every new arrival Exponentially distributed session times Each node does 1 lookup every 10 seconds Log results, process them after test

Transcript of EECS 262a Advanced Topics in Computer Systems Lecture 17 ...

Page 1: EECS 262a Advanced Topics in Computer Systems Lecture 17 ...

EECS 262a Advanced Topics in Computer Systems

Lecture 17

P2P Storage: Dynamo/PondOctober 29th, 2019

John KubiatowiczElectrical Engineering and Computer Sciences

University of California, Berkeley

http://www.eecs.berkeley.edu/~kubitron/cs262

10/29/2019 2cs262a-F19 Lecture-17

Reprise: Stability under churn (Tapestry)

(May 2003: 1.5 TB over 4 hours)DOLR Model generalizes to many simultaneous apps

10/29/2019 3cs262a-F19 Lecture-17

Churn (Optional Bamboo paper last time)

Authors Systems Observed Session TimeSGG02 Gnutella, Napster 50% < 60 minutesCLL02 Gnutella, Napster 31% < 10 minutesSW02 FastTrack 50% < 1 minuteBSV03 Overnet 50% < 60 minutesGDS03 Kazaa 50% < 2.4 minutes

Chord is a “scalable protocol for lookup in a dynamic peer-to-peer system with frequent node arrivals and departures” -- Stoica et al., 2001

10/29/2019 4cs262a-F19 Lecture-17

A Simple lookup Test• Start up 1,000 DHT nodes on ModelNet network

– Emulates a 10,000-node, AS-level topology– Unlike simulations, models cross traffic and packet loss– Unlike PlanetLab, gives reproducible results

• Churn nodes at some rate– Poisson arrival of new nodes– Random node departs on every new arrival– Exponentially distributed session times

• Each node does 1 lookup every 10 seconds– Log results, process them after test

Page 2: EECS 262a Advanced Topics in Computer Systems Lecture 17 ...

10/29/2019 5cs262a-F19 Lecture-17

Early Test Results

• Tapestry had trouble under this level of stress– Worked great in simulations, but not as well on more realistic

network– Despite sharing almost all code between the two!

• Problem was not limited to Tapestry consider Chord:

10/29/2019 6cs262a-F19 Lecture-17

Handling Churn in a DHT

• Forget about comparing different impls.– Too many differing factors– Hard to isolate effects of any one feature

• Implement all relevant features in one DHT– Using Bamboo (similar to Pastry)

• Isolate important issues in handling churn1. Recovering from failures2. Routing around suspected failures3. Proximity neighbor selection

10/29/2019 7cs262a-F19 Lecture-17

Reactive Recovery: The obvious technique• For correctness, maintain leaf set during churn

– Also routing table, but not needed for correctness• The Basics

– Ping new nodes before adding them– Periodically ping neighbors– Remove nodes that don’t respond

• Simple algorithm– After every change in leaf set, send to all neighbors– Called reactive recovery

10/29/2019 8cs262a-F19 Lecture-17

The Problem With Reactive Recovery

• Under churn, many pings and change messages– If bandwidth limited, interfere with each other– Lots of dropped pings looks like a failure

• Respond to failure by sending more messages– Probability of drop goes up– We have a positive feedback cycle (squelch)

• Can break cycle two ways1. Limit probability of “false suspicions of failure”2. Recovery periodically

Page 3: EECS 262a Advanced Topics in Computer Systems Lecture 17 ...

10/29/2019 9cs262a-F19 Lecture-17

Periodic Recovery

• Periodically send whole leaf set to a random member

– Breaks feedback loop– Converges in O(log N)

• Back off period on message loss

– Makes a negative feedback cycle (damping)

10/29/2019 10cs262a-F19 Lecture-17

Conclusions/Recommendations

• Avoid positive feedback cycles in recovery– Beware of “false suspicions of failure”– Recover periodically rather than reactively

• Route around potential failures early– Don’t wait to conclude definite failure– TCP-style timeouts quickest for recursive routing– Virtual-coordinate-based timeouts not prohibitive

• PNS can be cheap and effective– Only need simple random sampling

10/29/2019 11cs262a-F19 Lecture-17

Today’s Papers• Dynamo: Amazon’s Highly Available Key-value Store, Giuseppe DeCandia,

Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, AvinashLakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall and Werner Vogels. Appears in Proceedings of the Symposium on Operating Systems Design and Implementation (OSDI), 2007

• Pond: the OceanStore Prototype, Sean Rhea, Patrick Eaton, Dennis Geels, Hakim Weatherspoon, Ben Zhao, and John Kubiatowicz. Appears in Proceedings of the 2nd USENIX Conference on File and Storage Technologies (FAST), 2003

• Thoughts?

10/29/2019 12cs262a-F19 Lecture-17

The “Traditional” approaches to storage• Relational Database systems

– Clustered - Traditional Enterprise RDBMS provide the ability to cluster and replicate data over multiple servers – providing reliability

» Oracle, Microsoft SQL Server and even MySQL have traditionally powered enterprise and online data clouds

– Highly Available – Provide Synchronization (“Always Consistent”), Load-Balancing and High-Availability features to provide nearly 100% Service Uptime

– Structured Querying – Allow for complex data models and structured querying – It is possible to off-load much of data processing and manipulation to the back-end database

• However, Traditional RDBMS clouds are: EXPENSIVE! To maintain, license and store large amounts of data

– The service guarantees of traditional enterprise relational databases like Oracle, put high overheads on the cloud

– Complex data models make the cloud more expensive to maintain, update and keep synchronized

– Load distribution often requires expensive networking equipment– To maintain the “elasticity” of the cloud, often requires expensive

upgrades to the network

Page 4: EECS 262a Advanced Topics in Computer Systems Lecture 17 ...

10/29/2019 13cs262a-F19 Lecture-17

The Solution: Simplify• Downgrade some of the service guarantees of traditional

RDBMS– Replace the highly complex data models with a simpler one

» Classify services based on complexity of data model they require– Replace the “Always Consistent” guarantee synchronization

model with an “Eventually Consistent” model» Classify services based on how “updated” their data sets must be

• Redesign or distinguish between services that require a simpler data model and lower expectations on consistency

10/29/2019 14cs262a-F19 Lecture-17

Why Peer-to-Peer ideas for storage?• Incremental Scalability

– Add or remove nodes as necessary» Systems stays online during changes

– With many other systems:» Must add large groups of nodes at once» System downtime during change in active set of nodes

• Low Management Overhead (related to first property)– System automatically adapts as nodes die or are added– Data automatically migrated to avoid failure or take advantage of new

nodes• Self Load-Balance

– Automatic partitioning of data among available nodes– Automatic rearrangement of information or query loads to avoid hot-

spots• Not bound by commercial notions of semantics

– Can use weaker consistency when desired– Can provide flexibility to vary semantics on a per-application basis– Leads to higher efficiency or performance

10/29/2019 15cs262a-F19 Lecture-17

Recall: Consistent hashing [Karger 97]

N32

N90

N105

K80

K20

K5

Circular 160-bit

ID space

Key 5Node 105

A key is stored at its successor: node with next higher ID10/29/2019 16cs262a-F19 Lecture-17

Recall: Lookup with Leaf Set

0…

10…

110…

111…

Lookup ID

Source• Assign IDs to nodes– Map hash values to

node with closest ID• Leaf set is successors

and predecessors– All that’s needed for

correctness• Routing table matches

successively longer prefixes

– Allows efficient lookups• Data Replication:

– On leaf set

Page 5: EECS 262a Advanced Topics in Computer Systems Lecture 17 ...

10/29/2019 17cs262a-F19 Lecture-17

Advantages/Disadvantages of Consistent Hashing

• Advantages:– Automatically adapts data partitioning as node membership changes– Node given random key value automatically “knows” how to participate in

routing and data management– Random key assignment gives approximation to load balance

• Disadvantages– Uneven distribution of key storage natural consequence of random node

names Leads to uneven query load– Key management can be expensive when nodes transiently fail

» Assuming that we immediately respond to node failure, must transfer state to new node set

» Then when node returns, must transfer state back» Can be a significant cost if transient failure common

• Disadvantages of “Scalable” routing algorithms– More than one hop to find data O(log N) or worse– Number of hops unpredictable and almost always > 1

» Node failure, randomness, etc

10/29/2019 18cs262a-F19 Lecture-17

Dynamo Goals• Scale – adding systems to network causes minimal

impact• Symmetry – No special roles, all features in all nodes• Decentralization – No Master node(s)• Highly Available – Focus on end user experience• SPEED – A system can only be as fast as the lowest level• Service Level Agreements – System can be adapted to

an application’s specific needs, allows flexibility

10/29/2019 19cs262a-F19 Lecture-17

Dynamo Assumptions• Query Model – Simple interface exposed to application level

– Get(), Put()– No Delete()– No transactions, no complex queries

• Atomicity, Consistency, Isolation, Durability– Operations either succeed or fail, no middle ground– System will be eventually consistent, no sacrifice of availability to

assure consistency– Conflicts can occur while updates propagate through system– System can still function while entire sections of network are down

• Efficiency – Measure system by the 99.9th percentile– Important with millions of users, 0.1% can be in the 10,000s

• Non Hostile Environment – No need to authenticate query, no malicious queries– Behind web services, not in front of them

10/29/2019 20cs262a-F19 Lecture-17

Service Level Agreements (SLA)• Application can deliver its

functionality in a bounded time: – Every dependency in the platform

needs to deliver its functionality with even tighter bounds.

• Example: service guaranteeing that it will provide a response within 300ms for 99.9% of its requests for a peak client load of 500 requests per second

• Contrast to services which focus on mean response time

Service-oriented architecture of

Amazon’s platform

Page 6: EECS 262a Advanced Topics in Computer Systems Lecture 17 ...

10/29/2019 21cs262a-F19 Lecture-17

Partitioning and Routing Algorithm• Consistent hashing:

– the output range of a hash function is treated as a fixed circular spaceor “ring”.

• Virtual Nodes: – Each physical node can be responsible

for more than one virtual node– Used for load balancing

• Routing: “zero-hop”– Every node knows about every other node– Queries can be routed directly to the root node for given key– Also – every node has sufficient information to route query to all nodes

that store information about that key

10/29/2019 22cs262a-F19 Lecture-17

Advantages of using virtual nodes

• If a node becomes unavailable the load handled by this node is evenly dispersed across the remaining available nodes.

• When a node becomes available again, the newly available node accepts a roughly equivalent amount of load from each of the other available nodes.

• The number of virtual nodes that a node is responsible can decided based on its capacity, accounting for heterogeneity in the physical infrastructure.

10/29/2019 23cs262a-F19 Lecture-17

Replication• Each data item is replicated

at N hosts.• “preference list”: The list of

nodes responsible for storing a particular key

– Successive nodes not guaranteed to be on different physical nodes

– Thus preference list includes physically distinct nodes• Replicas synchronized via anti-entropy protocol

– Use of Merkle tree for each unique range– Nodes exchange root of trees for shared key range

10/29/2019 24cs262a-F19 Lecture-17

Data Versioning• A put() call may return to its caller before the update has

been applied at all the replicas• A get() call may return many versions of the same object.• Challenge: an object having distinct version sub-histories, which

the system will need to reconcile in the future.• Solution: uses vector clocks in order to capture causality between

different versions of the same object.

Page 7: EECS 262a Advanced Topics in Computer Systems Lecture 17 ...

10/29/2019 25cs262a-F19 Lecture-17

Vector Clock• A vector clock is a list of (node, counter) pairs.• Every version of every object is associated with one

vector clock.• If the counters on the first object’s clock are less-than-or-

equal to all of the nodes in the second clock, then the first is an ancestor of the second and can be forgotten.

10/29/2019 26cs262a-F19 Lecture-17

Vector clock example

10/29/2019 27cs262a-F19 Lecture-17

Conflicts (multiversion data)• Client must resolve conflicts

– Only resolve conflicts on reads – Different resolution options:

» Use vector clocks to decide based on history» Use timestamps to pick latest version

– Examples given in paper:» For shopping cart, simply merge different versions» For customer’s session information, use latest version

– Stale versions returned on reads are updated (“read repair”)• Vary N, R, W to match requirements of applications

– High performance reads: R=1, W=N– Fast writes with possible inconsistency: W=1– Common configuration: N=3, R=2, W=2

• When do branches occur?– Branches uncommon: 0.0006% of requests saw > 1 version over 24 hours– Divergence occurs because of high write rate (more coordinators), not

necessarily because of failure10/29/2019 28cs262a-F19 Lecture-17

Execution of get () and put () operations• Route its request through a generic load balancer that will

select a node based on load information– Simple idea, keeps functionality within Dynamo

• Use a partition-aware client library that routes requests directly to the appropriate coordinator nodes

– Requires client to participate in protocol– Much higher performance

Page 8: EECS 262a Advanced Topics in Computer Systems Lecture 17 ...

10/29/2019 29cs262a-F19 Lecture-17

Sloppy Quorum• R/W is the minimum number of nodes that must

participate in a successful read/write operation.• Setting R + W > N yields a quorum-like system.• In this model, the latency of a get (or put) operation is

dictated by the slowest of the R (or W) replicas. For this reason, R and W are usually configured to be less than N, to provide better latency.

10/29/2019 30cs262a-F19 Lecture-17

Hinted handoff

• Assume N = 3. When B is temporarily down or unreachable during a write, send replica to E

• E is hinted that the replica belongs to B and it will deliver to B when B is recovered.

• Again: “always writeable”

10/29/2019 31cs262a-F19 Lecture-17

Implementation• Java

– Event-triggered framework similar to SEDA• Local persistence component allows for different storage

engines to be plugged in:– Berkeley Database (BDB) Transactional Data Store: object of tens

of kilobytes– MySQL: object of > tens of kilobytes– BDB Java Edition, etc.

10/29/2019 32cs262a-F19 Lecture-17

Summary of techniques used in Dynamo and their advantages

Problem Technique Advantage

Partitioning Consistent Hashing Incremental Scalability

High Availability for writes Vector clocks with reconciliation during reads

Version size is decoupled from update rates.

Handling temporary failures Sloppy Quorum and hinted handoff Provides high availability and durability guarantee when some of

the replicas are not available.

Recovering from permanent failures Anti-entropy using Merkle trees Synchronizes divergent replicas in

the background.

Membership and failure detection Gossip-based membership protocol and failure detection.

Preserves symmetry and avoids having a centralized registry for storing membership and node

liveness information.

Page 9: EECS 262a Advanced Topics in Computer Systems Lecture 17 ...

10/29/2019 33cs262a-F19 Lecture-17

Evaluation

10/29/2019 34cs262a-F19 Lecture-17

Evaluation: Relaxed durabilityperformance

10/29/2019 35cs262a-F19 Lecture-17

Is this a good paper?• What were the authors’ goals?• What about the evaluation/metrics?• Did they convince you that this was a good

system/approach?• Were there any red-flags?• What mistakes did they make?• Does the system/approach meet the “Test of Time”

challenge?• How would you review this paper today?

10/29/2019 36cs262a-F19 Lecture-17

BREAK

Page 10: EECS 262a Advanced Topics in Computer Systems Lecture 17 ...

10/29/2019 37cs262a-F19 Lecture-17

OceanStore Vision: Utility Infrastructure

PacBell

Sprint

IBMAT&T

CanadianOceanStore

IBM

• Data service provided by storage federation• Cross-administrative domain • Contractual Quality of Service (“someone to sue”)

10/29/2019 38cs262a-F19 Lecture-17

What are the advantages of a utility?• For Clients:

– Outsourcing of Responsibility» Someone else worries about quality of service

– Better Reliability» Utility can muster greater resources toward durability» System not disabled by local outages» Utility can focus resources (manpower) at security-vulnerable aspects

of system– Better data mobility

» Starting with secure network modelsharing• For Utility Provider:

– Economies of scale» Dynamically redistribute resources between clients» Focused manpower can serve many clients simultaneously

10/29/2019 39cs262a-F19 Lecture-17

Key Observation:Want Automatic Maintenance

• Can’t possibly manage billions of servers by hand!• System should automatically:

– Adapt to failure – Exclude malicious elements– Repair itself – Incorporate new elements

• System should be secure and private– Encryption, authentication

• System should preserve data over the long term (accessiblefor 100s of years):

– Geographic distribution of information– New servers added/Old servers removed– Continuous Repair Data survives for long term

10/29/2019 40cs262a-F19 Lecture-17

• Untrusted Infrastructure: – The OceanStore is comprised of untrusted components– Individual hardware has finite lifetimes– All data encrypted within the infrastructure

• Mostly Well-Connected:– Data producers and consumers are connected to a high-bandwidth

network most of the time– Exploit multicast for quicker consistency when possible

• Promiscuous Caching:– Data may be cached anywhere, anytime

• Responsible Party:– Some organization (i.e. service provider) guarantees that your data is

consistent and durable– Not trusted with content of data, merely its integrity

OceanStore AssumptionsPeer-to-peer

Quality-of-Service

Page 11: EECS 262a Advanced Topics in Computer Systems Lecture 17 ...

10/29/2019 41cs262a-F19 Lecture-17

Recall: Routing to Objects (Tapestry)

GUID1

DOLR

GUID1GUID2

10/29/2019 42cs262a-F19 Lecture-17

OceanStore Data Model• Versioned Objects

– Every update generates a new version– Can always go back in time (Time Travel)

• Each Version is Read-Only– Can have permanent name– Much easier to repair

• An Object is a signed mapping between permanent name and latest version

– Write access control/integrity involves managing these mappings

Comet Analogy updates

versions

10/29/2019 43cs262a-F19 Lecture-17

Self-Verifying Objects

Data

Blocks

VGUIDi VGUIDi + 1

d2 d4d3 d8d7d6d5 d9d1

Data

B -Tree

Indirect

Blocks

M

d'8 d'9

Mbackpointer

copy on write

copy on write

AGUID = hash{name+keys}

UpdatesHeartbeats +

Read-Only Data

Heartbeat: {AGUID,VGUID, Timestamp}signed

10/29/2019 44cs262a-F19 Lecture-17

Two Types of OceanStore Data• Active Data: “Floating Replicas”

– Per object virtual server– Interaction with other replicas for consistency– May appear and disappear like bubbles

• Archival Data: OceanStore’s Stable Store– m-of-n coding: Like hologram

» Data coded into n fragments, any m of which are sufficient to reconstruct (e.g m=16, n=64)

» Coding overhead is proportional to nm (e.g 4)– Fragments are cryptographically self-verifying

• Most data in the OceanStore is archival!

Page 12: EECS 262a Advanced Topics in Computer Systems Lecture 17 ...

10/29/2019 45cs262a-F19 Lecture-17

The Path of an OceanStore Update

Second-TierCaches

Inner-RingServers

Clients

10/29/2019 46cs262a-F19 Lecture-17

Byzantine Agreement• Guarantees all non-faulty replicas agree

– Given N =3f +1 replicas, up to f may be faulty/corrupt• Expensive

– Requires O(N 2) communication• Combine with primary-copy replication

– Small number participate in Byzantine agreement– Multicast results of decisions to remainder

• Threshold Signatures– Need at least f signature shares to generate a complete signature

10/29/2019 47cs262a-F19 Lecture-17

OceanStore API:Universal Conflict Resolution

• Consistency is form of optimistic concurrency – Updates contain predicate-action pairs – Each predicate tried in turn:

» If none match, the update is aborted» Otherwise, action of first true predicate is applied

• Role of Responsible Party (RP):– Updates submitted to RP which chooses total order

IMAP/SMTPNFS/AFS NTFS (soon?)HTTPNative Clients

1. Conflict Resolution2. Versioning/Branching 3. Access control4. Archival Storage

OceanStoreAPI

10/29/2019 48cs262a-F19 Lecture-17

Peer-to-Peer Caching:Automatic Locality Management

• Self-Organizing mechanisms to place replicas• Automatic Construction of Update Multicast

Primary Copy

Page 13: EECS 262a Advanced Topics in Computer Systems Lecture 17 ...

10/29/2019 49cs262a-F19 Lecture-17

Archival Disseminationof Fragments

ArchivalServers

ArchivalServers

10/29/2019 50cs262a-F19 Lecture-17

Aside: Why erasure coding?High Durability/overhead ratio!

• Exploit law of large numbers for durability!• 6 month repair, FBLPY:

– Replication: 0.03– Fragmentation: 10-35

Fraction Blocks Lost

Per Year (FBLPY)

10/29/2019 51cs262a-F19 Lecture-17

Extreme Durability

• Exploiting Infrastructure for Repair– DOLR permits efficient heartbeat mechanism to notice:

» Servers going away for a while» Or, going away forever!

– Continuous sweep through data also possible– Erasure Code provides Flexibility in Timing

• Data transferred from physical medium to physical medium– No “tapes decaying in basement”– Information becomes fully Virtualized

• Thermodynamic Analogy: Use of Energy (supplied by servers) to Suppress Entropy

10/29/2019 52cs262a-F19 Lecture-17

Differing Degrees of Responsibility

• Inner-ring provides quality of service– Handles of live data and write access control– Focus utility resources on this vital service– Compromised servers must be detected quickly

• Caching service can be provided by anyone– Data encrypted and self-verifying– Pay for service “Caching Kiosks”?

• Archival Storage and Repair– Read-only data: easier to authenticate and repair– Tradeoff redundancy for responsiveness

• Could be provided by different companies!

Page 14: EECS 262a Advanced Topics in Computer Systems Lecture 17 ...

10/29/2019 53cs262a-F19 Lecture-17

OceanStore Prototype (Pond)• All major subsystems operational

– Self-organizing Tapestry base– Primary replicas use Byzantine agreement– Secondary replicas self-organize into multicast tree– Erasure-coding archive– Application interfaces: NFS, IMAP/SMTP, HTTP

• 280K lines of Java (J2SE v1.3)– JNI libraries for cryptography, erasure coding

• PlanetLab Deployment (FAST 2003, “Pond” paper)– 220 machines at 100 sites

in North America, Europe, Australia, Asia, etc.

– 1.26Ghz PIII (1GB RAM), 1.8Ghz PIV (2GB RAM)

– OceanStore code running with 1000 virtual-node emulations

10/29/2019 54cs262a-F19 Lecture-17

Event-Driven Architecture

• Data-flow style– Arrows Indicate flow of messages

• Potential to exploit small multiprocessors at each physical node

World

10/29/2019 55cs262a-F19 Lecture-17

Why aren’t we usingPond every Day?

10/29/2019 56cs262a-F19 Lecture-17

• Had Reasonable Stability: – In simulation– Or with small error rate

• But trouble in wide area:– Nodes might be lost and

never reintegrate– Routing state might

become stale or be lost• Why?

– Complexity of algorithms – Wrong design paradigm:

strict rather than loose state– Immediate repair of faults

• Ultimately, Tapestry Routing Framework succumbed to: – Creeping Featurism (designed by several people)– Fragilility under churn– Code Bloat

Problem #1: DOLR is Great Enabler—but only if it is stable

Page 15: EECS 262a Advanced Topics in Computer Systems Lecture 17 ...

10/29/2019 57cs262a-F19 Lecture-17

• Simple, Stable, Targeting Failure• Rethinking of design of Tapestry:

– Separation of correctness from performance– Periodic recovery instead of reactive recovery– Network understanding

(e.g. timeout calculation)– Simpler Node Integration

(smaller amount of state)• Extensive testing under

Churn and partition• Bamboo is so stable that

it is part of the OpenHashpublic DHT infrastructure.

• In wide use by many researchers

Answer: Bamboo!

10/29/2019 58cs262a-F19 Lecture-17

Problem #2: Pond Write Latency• Byzantine algorithm adapted from Castro & Liskov

– Gives fault tolerance, security against compromise– Fast version uses symmetric cryptography

• Pond uses threshold signatures instead– Signature proves that f +1 primary replicas agreed– Can be shared among secondary replicas– Can also change primaries w/o changing public key

• Big plus for maintenance costs– Results good for all time once signed– Replace faulty/compromised servers transparently

10/29/2019 59cs262a-F19 Lecture-17

Closer Look: Write Cost• Small writes

– Signature dominates– Threshold sigs. slow!– Takes 70+ ms to sign– Compare to 5 ms

for regular sigs.• Large writes

– Encoding dominates– Archive cost per byte– Signature cost per write

• Answer: Reduction in overheads– More Powerful Hardware at Core– Cryptographic Hardware

» Would greatly reduce write cost» Possible use of ECC or other signature method

– Offloading of Archival Encoding

Phase4 kB

write2 MB write

Validate 0.3 0.4Serialize 6.1 26.6Apply 1.5 113.0Archive 4.5 566.9Sign Result 77.8 75.8

(times in milliseconds)

10/29/2019 60cs262a-F19 Lecture-17

Problem #3: Efficiency• No resource aggregation

– Small blocks spread widely– Every block of every file on different set of servers– Not uniquely OceanStore issue!

• Answer: Two-Level Naming– Place data in larger chunks (‘extents’)– Individual access of blocks by name within extents

– Bonus: Secure Log good interface for secure archive– Antiquity: New Prototype for archival storage– Similarity to SSTable use in BigTable?

get( E1,R1 )

V2 R2 I3 B6 B5 V1 R1 I2 B4 B3 I1 B2 B1

E0E1

Page 16: EECS 262a Advanced Topics in Computer Systems Lecture 17 ...

10/29/2019 61cs262a-F19 Lecture-17

Problem #4: Complexity• Several of the mechanisms were complex

– Ideas were simple, but implementation was complex– Data format combination of live and archival features– Byzantine Agreement hard to get right

• Ideal layering not obvious at beginning of project:– Many Applications Features placed into Tapestry– Components not autonomous, i.e. able to be tied in at any

moment and restored at any moment• Top-down design lost during thinking and

experimentation• Everywhere: reactive recovery of state

– Original Philosophy: Get it right once, then repair– Much Better: keep working toward ideal

(but assume never make it)

10/29/2019 62cs262a-F19 Lecture-17

Other Issues/Ongoing Work at Time:• Archival Repair Expensive if done incorrectly:

– Small blocks consume excessive storage and network bandwidth

– Transient failures consume unnecessary repair bandwidth– Solutions: collect blocks into extents and use threshold repair

• Resource Management Issues– Denial of Service/Over Utilization of Storage serious threat– Solution: Exciting new work on fair allocation

• Inner Ring provides incomplete solution:– Complexity with Byzantine agreement algorithm is a problem– Working on better Distributed key generation– Better Access control + secure hardware + simpler Byzantine

Algorithm?• Handling of low-bandwidth links and Partial Disconnection

– Improved efficiency of data storage– Scheduling of links– Resources are never unbounded

• Better Replica placement through game theory?

10/29/2019 63cs262a-F19 Lecture-17

Follow-on Work

10/29/2019 64cs262a-F19 Lecture-17

Bamboo OpenDHT• PL deployment running for several months• Put/get via RPC over TCP

Page 17: EECS 262a Advanced Topics in Computer Systems Lecture 17 ...

10/29/2019 65cs262a-F19 Lecture-17

OceanStore Archive Antiquity• Secure Log:

– Can only modify at one point – log head. » Makes consistency easier

– Self-verifying» Every entry securely points to previous forming Merkle chain» Prevents substitution attacks

– Random read access – can still read efficiently• Simple and secure primitive for storage

– Log identified by cryptographic key pair– Only owner of private key can modify log– Thin interface, only append()

• Amenable to secure, durable implementation– Byzantine quorum of storage servers

» Can survive failures at O(n) cost instead of O(n2) cost– Efficiency through aggregation

» Use of Extents and Two-Level naming

10/29/2019 66cs262a-F19 Lecture-17

Storage SystemV1 R1 I2 B4 B3 I1 B2 B1

V1R1

I2B4

B3I1

B2B1

V1 R1 I2 B4 B3 I1 B2 B1

Antiquity Architecture:Universal Secure Middleware

App

App

Server

App

ReplicatedService

• Data Source– Creator of data

• Client– Direct user of system

» “Middleware”» End-user, Server,

Replicated service– append()’s to log– Signs requests

• Storage Servers– Store log replicas on disk– Dynamic Byzantine quorums

» Consistency and durability• Administrator

– Selects storage servers• Prototype operational on PlanetLab

10/29/2019 67cs262a-F19 Lecture-17

Is this a good paper?• What were the authors’ goals?• What about the evaluation/metrics?• Did they convince you that this was a good

system/approach?• Were there any red-flags?• What mistakes did they make?• Does the system/approach meet the “Test of Time”

challenge?• How would you review this paper today?