OceanStore Status and Directions ROC/OceanStore Retreat 1/16/01

25
OceanStore Status and Directions ROC/OceanStore Retreat 1/16/01 John Kubiatowicz University of California at Berkeley

description

OceanStore Status and Directions ROC/OceanStore Retreat 1/16/01. John Kubiatowicz University of California at Berkeley. Questions about ubiquitous information:. Where is persistent information stored? - PowerPoint PPT Presentation

Transcript of OceanStore Status and Directions ROC/OceanStore Retreat 1/16/01

Page 1: OceanStore Status and Directions ROC/OceanStore Retreat 1/16/01

OceanStoreStatus and Directions

ROC/OceanStore Retreat 1/16/01

John KubiatowiczUniversity of California at Berkeley

Page 2: OceanStore Status and Directions ROC/OceanStore Retreat 1/16/01

OceanStore:2ROC/OceanStore Jan’01

Questions about ubiquitous information:

• Where is persistent information stored?– Want: Geographic independence for availability,

durability, and freedom to adapt to circumstances

• How is it protected?– Want: Encryption for privacy, signatures for

authenticity, and Byzantine commitment for integrity

• Can we make it indestructible? – Want: Redundancy with continuous repair and

redistribution for long-term durability

• Is it hard to manage?– Want: automatic optimization, diagnosis and repair

Page 3: OceanStore Status and Directions ROC/OceanStore Retreat 1/16/01

OceanStore:3ROC/OceanStore Jan’01

Everyone’s Data, One Utility

• Millions of servers, billions of clients ….• 1000-YEAR durability (excepting fall of society)• Maintains Privacy, Access Control, Authenticity• Incrementally Scalable (“Evolvable”)• Self Maintaining!

• Not quite peer-to-peer: • Utilizing servers in infrastructure• Some computational nodes more equal than others

Page 4: OceanStore Status and Directions ROC/OceanStore Retreat 1/16/01

OceanStore:4ROC/OceanStore Jan’01

Want Automatic Maintenance• Can’t possibly manage billions of servers by hand!• System should:

– Be Fault-Tolerance (High MTTF) – Repair itself (Low MTTR through adaptation)– Incorporate new elements

• Can we guarantee data is available for 1000 years?– New servers added from time to time– Old servers removed from time to time– Everything just works

• Many components with geographic separation– System not disabled by natural disasters– Can adapt to changes in demand and regional outages– Gain in stability through statistics

Page 5: OceanStore Status and Directions ROC/OceanStore Retreat 1/16/01

OceanStore:5ROC/OceanStore Jan’01

OceanStore Assumptions• Untrusted Infrastructure:

– The OceanStore is comprised of untrusted components

– Only ciphertext within the infrastructure• Responsible Party:

– Some organization (i.e. service provider) guarantees that your data is consistent and durable

– Not trusted with content of data, merely its integrity

• Mostly Well-Connected: – Data producers and consumers are connected to a

high-bandwidth network most of the time– Exploit multicast for quicker consistency when

possible• Promiscuous Caching:

– Data may be cached anywhere, anytime

Page 6: OceanStore Status and Directions ROC/OceanStore Retreat 1/16/01

OceanStore:6ROC/OceanStore Jan’01

This Talk: making it real!

(Or: you will hear reality from my students)

Page 7: OceanStore Status and Directions ROC/OceanStore Retreat 1/16/01

OceanStore:7ROC/OceanStore Jan’01

The Path of an OceanStore Update

Second-TierCaches

Multicasttrees

Inner-RingServers

Clients

Page 8: OceanStore Status and Directions ROC/OceanStore Retreat 1/16/01

OceanStore:8ROC/OceanStore Jan’01

Important Components:• Data Object: (Distribution-enabled data format)

– Must support copy-on-write and versioning efficiently– Must allow sparse population of data in caches– Must smoothly interface with archive

• Inner Ring: (Byzantine Agreement)– Check write access control– Choose seriallize updates/resolve micro-conflicts– Sign result with Threshold Signature– Erasure code result and send fragments

• Second Tier Server: (Promiscuous Caches)– Serve local clients– Tie itself into Dissemination tree – Apply updates that it receives through tree– Decision point for caching policies: tentative vs committed

Page 9: OceanStore Status and Directions ROC/OceanStore Retreat 1/16/01

OceanStore:9ROC/OceanStore Jan’01

Implementation Framework

Asynchronous DiskAsynchronous Network

Network

Operating System

Java Virtual Machine

ThreadScheduler

X

Y

Consistency

Location & Routing

Archival

IntrospectionModulesD

ispatch

4

2

31

4

• Event-driven Implementation Model in Java– Divided into a sequence of communicating

“stages”– Communication between stages in the form of

“snoopable” messages– > 100,000 lines of Java, Comments, Test scripts– Substantially functioning!

Page 10: OceanStore Status and Directions ROC/OceanStore Retreat 1/16/01

OceanStore:10ROC/OceanStore Jan’01

GUIDs for Naming• Unique, location independent identifiers:

– Every version of every unique entity has a permanent, Version-GUID (or VGUID): Hash over content Versioning supports time-travel

– Each object has a permanent (version-independent) Archival-GUID (or AGUID):

– Signed Associations between AGUIDs and latest VGUIDs are produced by inner ring (called Heartbeats)

• Naming hierarchy:– Users map from names to AGUIDs via hierarchy of

OceanStore objectsEach link is an AGUIDFoo

BarBaz

Myfile

Out-of-Band“Root link”

Page 11: OceanStore Status and Directions ROC/OceanStore Retreat 1/16/01

OceanStore:11ROC/OceanStore Jan’01

Data Object StructureAll about flexibility and

validationData B -Tree

Unit of Coding

Encoded Fragments:Unit of Archival Storage

Verification Tree

GUID of d1

Blocks

Indirect Blocks

Unit of Coding

Encoded Fragments:Unit of Archival Storage

Verification Tree

LogObject

Check Point == V6 Check Point == V11

LogEntries V7

V8

V9

V10

Set of

Indirect Blocks

Blocks

d1 d2 d4d3 d8d7d6d5 d9d'8 d'9

M M

GUID of d'8

Page 12: OceanStore Status and Directions ROC/OceanStore Retreat 1/16/01

OceanStore:12ROC/OceanStore Jan’01

Status:Data Object Development

• Second-Tier Replica support: functional– Second-tier caches can hold multiple versions– Tie themselves into multicast trees

• Several dissemination tree algorithms explored• Updates forwarded from inner ring through trees

• Complete B-Tree object structure developed– Data blocks named with unforgeable hashes

• Hashes can point to archival fragments/live blocks– Supports copy on write– Top block defines complete version

• Missing blocks filled in from archive or other replicas

• Update commits with distributed threshold signatures– Byzantine commitment not quite integrated into prototype

• Traffic generator for testing

Page 13: OceanStore Status and Directions ROC/OceanStore Retreat 1/16/01

OceanStore:13ROC/OceanStore Jan’01

Exploiting Law of Large Numbers for Durability

Page 14: OceanStore Status and Directions ROC/OceanStore Retreat 1/16/01

OceanStore:14ROC/OceanStore Jan’01

The Dissemination Process

Model Builder

Set Creator

IntrospectionHuman Input

Network

Monitoringmodel

Disseminator

Disseminatorse

t

set

probe

type

fragments

fragments

fragments

Page 15: OceanStore Status and Directions ROC/OceanStore Retreat 1/16/01

OceanStore:15ROC/OceanStore Jan’01

Achieving Low MTTR:Global Heartbeats

• Trigger repair when level of redundancy to low• Continuous sweep (slowly over time)

Page 16: OceanStore Status and Directions ROC/OceanStore Retreat 1/16/01

OceanStore:16ROC/OceanStore Jan’01

Status:Archival Infrastructure

• Archival Fragments generated by Inner Ring– Multi-stage-based implementation at inner ring– Storage servers hold fragments– Caching servers (2nd- tier replicas) hold data

objects

• Independence Analysis (mostly there)– Node discovery technique exists– Analysis of long-running reliability data– Dissemination-set creator: initial versions

• Storage servers (Naïve but functional):– Initial implementation: cache + object store– Ongoing tuning efforts– Redesign in the works

Page 17: OceanStore Status and Directions ROC/OceanStore Retreat 1/16/01

OceanStore:17ROC/OceanStore Jan’01

Location Independent Routing

• Paradigm: Routing– Route messages to objects by GUID regardless of location

• Fast, probabilistic search for “routing cache”:– Built from attenuated bloom filters– Approximation to gradient search

• Redundant Plaxton Mesh used for underlying routing infrastructure:– Randomized data structure with locality properties– Redundant, insensitive to faults, and repairable– Amenable to continuous adaptation to adjust for:

• Changing network behavior• Faulty servers• Denial of service attacks

• Tomorrow: 3 talks on Routing

Page 18: OceanStore Status and Directions ROC/OceanStore Retreat 1/16/01

OceanStore:18ROC/OceanStore Jan’01

Status: Location Independent Routing

• Basic Tapestry infrastructure is operational– Single-path static routing: works– Multi-path adaptive routing: mostly there– Dynamic Integration of new nodes: implemented

• Network adaptation almost there (Patchwork)– Framework for Measurement of network properties– Periodic beacons measure loss and network latency

• Exploitation of Differences in nodes:– Brocade backbone supplement to Tapestry: Improves routing– Differentiation in service experiments ongoing

• Theoretical Results on Tapestry– Construction/Analysis of Dynamic Integration Algorithms– Voluntary/involuntary node deletion algorithms– View of Tapestry as data structure for solving nearest neighbor

• Attenuated Bloom Filters are operational– Implemented and functional– Optimizes short-distance routing infrastructure!

Page 19: OceanStore Status and Directions ROC/OceanStore Retreat 1/16/01

OceanStore:19ROC/OceanStore Jan’01

Introspection:The New Architectural Creed

• Using Moore’s law gains for something other than performance

• Examples:– Online algorithmic validation– Model building for data rearrangement

• Availability• Better prefetching

– Extreme Durability (1000-year time scale?)• Use of erasure coding and continuous repair

– Stability through Statistics• Use of redundancy to gain more predictable

behavior• Systems version of Thermodynamics!

– Continuous Dynamic Optimization of other sorts

Adapt

Compute

Monitor

Page 20: OceanStore Status and Directions ROC/OceanStore Retreat 1/16/01

OceanStore:20ROC/OceanStore Jan’01

Status: Introspection• Development of OIL framework for introspection: this

framework is operational– Collection facilities can observe all events in the system– Multiple aggregation models available

• Example 1: Clustering for prefetching– Currently builds Hidden Markov-model of access patterns

utilizing OIL framework– Almost there:

• Use models to better prefetch objects• Placement of replices assisted by bloom filters (almost)

• Example 2: Observation of network behavior– Framework for observation of network latencies– Adaptation of network topology: almost there

• Example 3: Grammer building for prefetching– Experiment of introspection at processor level– Talk later today about this (Mark Whitney)

Page 21: OceanStore Status and Directions ROC/OceanStore Retreat 1/16/01

OceanStore:21ROC/OceanStore Jan’01

Status:Medium Scale Test and

Emulation• Two medium clusters from IBM SUR Grant

– Each cluster 21 servers:• Each with two 1 GHz processors• One GByte of RAM, 73 GB of Disk

– 1 GB Switch per cluster– MIRNET switch

• Plan to have continuous OceanStore components running – in approximately 1 month

• Emulation technology: currently works– Able to simulate large-scale network by simulating

network latencies– Multiple OceanStore nodes emulated/node

Page 22: OceanStore Status and Directions ROC/OceanStore Retreat 1/16/01

OceanStore:22ROC/OceanStore Jan’01

Reality: Web Caching through

OceanStore

Page 23: OceanStore Status and Directions ROC/OceanStore Retreat 1/16/01

OceanStore:23ROC/OceanStore Jan’01

Day Dreams?(Becoming real)

• NFS File system built in OceanStore (Exists)– Still have to integrate ACLs– Update to latest prototype

• Windows Installable File system (Planning)– “USB Keys” hold cryptographic keys and personal

identity– Automatic downloading and verification of

filesystem

• IMAP OceanStore gateway (Planning)• Lotus Notes Domino Server

– Exploring use of work flow on top of OceanStore

Page 24: OceanStore Status and Directions ROC/OceanStore Retreat 1/16/01

OceanStore:24ROC/OceanStore Jan’01

OceanStore Conclusions• OceanStore: everyone’s data, one big utility

– Global Utility model for persistent data storage• Very Soon: Working OceanStore cluster!!!!

– Event-driven programming in Java– You will hear about components today and

tomorrow

• OceanStore assumptions:– Untrusted infrastructure with a responsible party– Mostly connected with conflict resolution– Continuous on-line optimization

Page 25: OceanStore Status and Directions ROC/OceanStore Retreat 1/16/01

OceanStore:25ROC/OceanStore Jan’01

For more info:

• OceanStore vision paper for ASPLOS 2000“OceanStore: An Architecture for Global-Scale

Persistent Storage”

• OceanStore paper on Maintenance (IEEE IC):

“Maintenance-Free Global Data Storage”

• Both available on OceanStore web site:http://oceanstore.cs.berkeley.edu/