Data-Centric Reconfiguration with Network-Attached Disks Alex Shraer (Technion) Joint work with:...
-
Upload
brenda-hubbard -
Category
Documents
-
view
224 -
download
0
Transcript of Data-Centric Reconfiguration with Network-Attached Disks Alex Shraer (Technion) Joint work with:...
Data-Centric Reconfiguration with Network-Attached Disks
Alex Shraer (Technion)
Joint work with:
J.P. Martin, D. Malkhi, M. K. Aguilera (MSR) I. Keidar (Technion)
Preview
2
• The setting: data-centric replicated storage– Simple network-attached storage-nodes
• Our contributions:
1. First distributed reconfigurable R/W storage
2. Asynch. VS. consensus-based reconfiguration
Allows to add/remove storage-nodes dynamically
Enterprise Storage Systems
• Highly reliable customized hardware
• Controllers, I/O ports may become a bottleneck
• Expensive
• Usually not extensible– Different solutions for different scale– Example(HP): High end - XP (1152 disks), Mid range – EVA (324 disks)
3
Alternative – Distributed Storage
• Made up of many storage nodes• Unreliable, cheap hardware• Failures are the norm, not an exception
• Challenges: – Achieving reliability and consistency– Supporting reconfigurations
4
Distributed Storage Architecture
• Unpredictable network delays (asynchrony)
Cloud Storage
LAN/ WAN
readwrite
5
Storage Clients
Dynamic, Fault-prone
Fault-proneStorage Nodes
A Case for Data-Centric Replication• Client-side code runs replication logic
– Communicates with multiple storage nodes
• Simple storage nodes (servers)– Can be network-attached disks
Not necessarily PCs with disks Do not run application-specific code Less fault-prone components
– Simply respond to client requests High throughput
– Do not communicate with each otherIf storage-nodes communicate, their failure is likely to be correlated!Oblivious to where other replicas of each object are storedScalable, same storage node can be used for many replication sets
not-so-thinclient
thin storage
node
Real Systems Are Dynamic
7
The challenge: maintain consistency , reliability, availability
LAN/ WAN
reconfig{–A, –B}
A
B CD
E
reconfig {–C, +F,…, +I}
F
G
I
H
Pitfall of Naïve Reconfiguration
8
A
B
C
D{A, B, C, D}
{A, B, C, D}{A, B, C, D}
{A, B, C, D}
{A, B, C, D}
{A, B, C, D}
{A, B, C, D, E}
{A, B, C}
{A, B, C, D, E}
{A, B, C, D, E}
{A, B, C}
{A, B, C}
E
delayed
delayed de
laye
d
delay
ed
reconfig {+E}
reconfig {-D}
{A, B, C, D, E}
Returns “Italy”!
Pitfall of Naïve Reconfiguration
9
A
B
C
D{A, B, C, D, E}
{A, B, C}
{A, B, C, D, E}
{A, B, C, D, E}
{A, B, C}
{A, B, C}
E
write x “Spain”
read x
{A, B, C, D, E}
X = “Italy”, 1
X = “Italy”, 1
X = “Spain”, 2
X = “Spain”, 2
X = “Spain”, 2
X = “Italy”, 1
X = “Italy”, 1
X = “Italy”, 1
Split Brain!
Reconfiguration Option 1: Centralized
• Can be automatic – E.g., Ursa Minor [Abd-El-Malek et al., FAST 05]
• Downtime – Most solutions stop R/W while reconfiguring
• Single point of failure– What if manager crashes while changing the system?
10
Tomorrow Technion servers will be down for maintenance from 5:30am to 6:45am
Virtually Yours,Moshe Barak
Reconfiguration Option 2: Distributed Agreement
• Servers agree on next configuration– Previous solutions not data-centric
• No downtime• In theory, might never terminate [FLP85]
• In practice, we have partial synchrony so it usually works
11
Reconfiguration Option 3: DynaStore [Aguilera, Keidar, Malkhi, S., PODC09]
12
• Distributed & completely asynchronous
• No downtime
• Always terminates
• Not data-centric
In this work: DynaDisk dynamic data-centric R/W storage
13
1. First distributed data-centric solution– No downtime
2. Tunable reconfiguration method– Modular design, coordination is separate from data– Allows easily setting/comparing the coordination method– Consensus-based VS. asynchronous reconfiguration
3. Many shared objects – Running a protocol instance per object too costly– Transferring all state at once might be infeasible– Our solution: incremental state transfer
4. Built with an external (weak) location service– We formally state the requirements from such a service
Location Service• Used in practice, ignored in theory
• We formalize the weak external service as an oracle:
• Not enough to solve reconfiguration
14
• oracle.query( ) returns some “legal” configuration
• If reconfigurations stop and oracle. query() invoked infinitely many times, it eventually returns last system configuration
The Coordination Module in DynaDisk
Storage devices in a configuration conf = {+A, +B, +C}
zx
y next config:
zx
y next config:
zx
y next config:
A B C
Distributed R/W objectsUpdated similarly to ABD Distributed “weak snapshot” object
API: update(set of changes)→OKscan() → set of updates
15
Coordination with Consensus
zx
y next config:
zx
y next config:
zx
y next config:
A B C
reconfig({–C}) reconfig({+D})
Consensus+D–C
+D+D+D
+D+D +D
update :
scan: read & write-back next config from majority• every scan returns +D or
16
Weak Snapshot – Weaker than consensus• No need to agree on the next configuration, as long as
each process has a set of possible next configurations,
and all such sets intersect– Intersection allows to converge and again use a single config
• Non-empty intersection property of weak snapshot:– Every two non-empty sets returned by scan( ) intersect– Example: Client 1’s scan Client 2’s scan
{+D} {+D}
{–C} {+D, –C}
{+D} {–C}
Consensus
17
Coordination without consensus
zx
y next config: z y next config: z y next config:
A B C
reconfig({–C}) reconfig({+D})
update :
scan: read & write-back proposals from majority (twice)
CAS({–C}, , 0)
+D
CAS({–C}, , 1)+D
–C
WRITE ({–C}, 0)OK OK
2221 1 10 0 0–C
Tracking Evolving Config’s
• With consensus: agree on next configuration
• Without consensus – usually a chain, sometimes a DAG:
19
A, B, C A,B,C,D+D C
A,B
A, B, D
A, B, C
+D
+D C
C
A,B,C,D
A, B, D
Inconsistent updates found
and merged
weak snapshot
scan() returns {+D, -C}
scan() returns {+D}
All non-empty scans intersect
Consensus-based VS. Asynch. Coordination
• Two implementations of weak snapshots– Asynchronous– Partially synchronous (consensus-based)
• Active Disk Paxos [Chockler, Malkhi, 2005]• Exponential backoff for leader-election
• Unlike asynchronous coordination, consensus-based might not terminate [FLP85]
• Storage overhead– Asynchronous: vector of updates
• vector size ≤ min(#reconfigs, #members in config)
– Consensus-based: 4 integers and the chosen update– Per storage device and configuration
20
Strong progress guarantees are not for free
Consensus-based
Asynchronous (no consensus)
0 1 2 50
50
100
150
200
250
300
350
400
450
ms.
Number of simultaneous reconfig operations
Average write latency
1 2 50
100
200
300
400
500
600
700
ms.
Average reconfig latency
Number of simultaneous reconfig operations
Significant negative
effect on R/W latency
Slightly better,much more predictable
reconfig latency when many reconfig execute
simultaneously
The same when no
reconfigurations21
Future & Ongoing Work
• Combine asynch. and partially-synch. coordination
• Consider other weak snapshot implementations– E.g., using randomized consensus
• Use weak snapshots to reconfigure other services– Not just for R/W
22
Summary• DynaDisk – dynamic data-centric R/W storage
– First decentralized solution– No downtime– Supports many objects, provides incremental reconfig– Uses one coordination object per config. (not per object)– Tunable reconfiguration method
• We implemented asynchronous and consensus-based• Many other implementations of weak-snapshots possible
• Asynchronous coordination in practice:– Works in more circumstances → more robust– But, at a cost – significantly affects ongoing R/W ops
23