OceanStoreStatus and Directions
ROC/OceanStore Retreat 1/16/01
John KubiatowiczUniversity of California at Berkeley
OceanStore:2ROC/OceanStore Jan’01
Questions about ubiquitous information:
• Where is persistent information stored?– Want: Geographic independence for availability,
durability, and freedom to adapt to circumstances
• How is it protected?– Want: Encryption for privacy, signatures for
authenticity, and Byzantine commitment for integrity
• Can we make it indestructible? – Want: Redundancy with continuous repair and
redistribution for long-term durability
• Is it hard to manage?– Want: automatic optimization, diagnosis and repair
OceanStore:3ROC/OceanStore Jan’01
Everyone’s Data, One Utility
• Millions of servers, billions of clients ….• 1000-YEAR durability (excepting fall of society)• Maintains Privacy, Access Control, Authenticity• Incrementally Scalable (“Evolvable”)• Self Maintaining!
• Not quite peer-to-peer: • Utilizing servers in infrastructure• Some computational nodes more equal than others
OceanStore:4ROC/OceanStore Jan’01
Want Automatic Maintenance• Can’t possibly manage billions of servers by hand!• System should:
– Be Fault-Tolerance (High MTTF) – Repair itself (Low MTTR through adaptation)– Incorporate new elements
• Can we guarantee data is available for 1000 years?– New servers added from time to time– Old servers removed from time to time– Everything just works
• Many components with geographic separation– System not disabled by natural disasters– Can adapt to changes in demand and regional outages– Gain in stability through statistics
OceanStore:5ROC/OceanStore Jan’01
OceanStore Assumptions• Untrusted Infrastructure:
– The OceanStore is comprised of untrusted components
– Only ciphertext within the infrastructure• Responsible Party:
– Some organization (i.e. service provider) guarantees that your data is consistent and durable
– Not trusted with content of data, merely its integrity
• Mostly Well-Connected: – Data producers and consumers are connected to a
high-bandwidth network most of the time– Exploit multicast for quicker consistency when
possible• Promiscuous Caching:
– Data may be cached anywhere, anytime
OceanStore:6ROC/OceanStore Jan’01
This Talk: making it real!
(Or: you will hear reality from my students)
OceanStore:7ROC/OceanStore Jan’01
The Path of an OceanStore Update
Second-TierCaches
Multicasttrees
Inner-RingServers
Clients
OceanStore:8ROC/OceanStore Jan’01
Important Components:• Data Object: (Distribution-enabled data format)
– Must support copy-on-write and versioning efficiently– Must allow sparse population of data in caches– Must smoothly interface with archive
• Inner Ring: (Byzantine Agreement)– Check write access control– Choose seriallize updates/resolve micro-conflicts– Sign result with Threshold Signature– Erasure code result and send fragments
• Second Tier Server: (Promiscuous Caches)– Serve local clients– Tie itself into Dissemination tree – Apply updates that it receives through tree– Decision point for caching policies: tentative vs committed
OceanStore:9ROC/OceanStore Jan’01
Implementation Framework
Asynchronous DiskAsynchronous Network
Network
Operating System
Java Virtual Machine
ThreadScheduler
X
Y
Consistency
Location & Routing
Archival
IntrospectionModulesD
ispatch
4
2
31
4
• Event-driven Implementation Model in Java– Divided into a sequence of communicating
“stages”– Communication between stages in the form of
“snoopable” messages– > 100,000 lines of Java, Comments, Test scripts– Substantially functioning!
OceanStore:10ROC/OceanStore Jan’01
GUIDs for Naming• Unique, location independent identifiers:
– Every version of every unique entity has a permanent, Version-GUID (or VGUID): Hash over content Versioning supports time-travel
– Each object has a permanent (version-independent) Archival-GUID (or AGUID):
– Signed Associations between AGUIDs and latest VGUIDs are produced by inner ring (called Heartbeats)
• Naming hierarchy:– Users map from names to AGUIDs via hierarchy of
OceanStore objectsEach link is an AGUIDFoo
BarBaz
Myfile
Out-of-Band“Root link”
OceanStore:11ROC/OceanStore Jan’01
Data Object StructureAll about flexibility and
validationData B -Tree
Unit of Coding
Encoded Fragments:Unit of Archival Storage
Verification Tree
GUID of d1
Blocks
Indirect Blocks
Unit of Coding
Encoded Fragments:Unit of Archival Storage
Verification Tree
LogObject
Check Point == V6 Check Point == V11
LogEntries V7
V8
V9
V10
Set of
Indirect Blocks
Blocks
d1 d2 d4d3 d8d7d6d5 d9d'8 d'9
M M
GUID of d'8
OceanStore:12ROC/OceanStore Jan’01
Status:Data Object Development
• Second-Tier Replica support: functional– Second-tier caches can hold multiple versions– Tie themselves into multicast trees
• Several dissemination tree algorithms explored• Updates forwarded from inner ring through trees
• Complete B-Tree object structure developed– Data blocks named with unforgeable hashes
• Hashes can point to archival fragments/live blocks– Supports copy on write– Top block defines complete version
• Missing blocks filled in from archive or other replicas
• Update commits with distributed threshold signatures– Byzantine commitment not quite integrated into prototype
• Traffic generator for testing
OceanStore:13ROC/OceanStore Jan’01
Exploiting Law of Large Numbers for Durability
OceanStore:14ROC/OceanStore Jan’01
The Dissemination Process
Model Builder
Set Creator
IntrospectionHuman Input
Network
Monitoringmodel
Disseminator
Disseminatorse
t
set
probe
type
fragments
fragments
fragments
OceanStore:15ROC/OceanStore Jan’01
Achieving Low MTTR:Global Heartbeats
• Trigger repair when level of redundancy to low• Continuous sweep (slowly over time)
OceanStore:16ROC/OceanStore Jan’01
Status:Archival Infrastructure
• Archival Fragments generated by Inner Ring– Multi-stage-based implementation at inner ring– Storage servers hold fragments– Caching servers (2nd- tier replicas) hold data
objects
• Independence Analysis (mostly there)– Node discovery technique exists– Analysis of long-running reliability data– Dissemination-set creator: initial versions
• Storage servers (Naïve but functional):– Initial implementation: cache + object store– Ongoing tuning efforts– Redesign in the works
OceanStore:17ROC/OceanStore Jan’01
Location Independent Routing
• Paradigm: Routing– Route messages to objects by GUID regardless of location
• Fast, probabilistic search for “routing cache”:– Built from attenuated bloom filters– Approximation to gradient search
• Redundant Plaxton Mesh used for underlying routing infrastructure:– Randomized data structure with locality properties– Redundant, insensitive to faults, and repairable– Amenable to continuous adaptation to adjust for:
• Changing network behavior• Faulty servers• Denial of service attacks
• Tomorrow: 3 talks on Routing
OceanStore:18ROC/OceanStore Jan’01
Status: Location Independent Routing
• Basic Tapestry infrastructure is operational– Single-path static routing: works– Multi-path adaptive routing: mostly there– Dynamic Integration of new nodes: implemented
• Network adaptation almost there (Patchwork)– Framework for Measurement of network properties– Periodic beacons measure loss and network latency
• Exploitation of Differences in nodes:– Brocade backbone supplement to Tapestry: Improves routing– Differentiation in service experiments ongoing
• Theoretical Results on Tapestry– Construction/Analysis of Dynamic Integration Algorithms– Voluntary/involuntary node deletion algorithms– View of Tapestry as data structure for solving nearest neighbor
• Attenuated Bloom Filters are operational– Implemented and functional– Optimizes short-distance routing infrastructure!
OceanStore:19ROC/OceanStore Jan’01
Introspection:The New Architectural Creed
• Using Moore’s law gains for something other than performance
• Examples:– Online algorithmic validation– Model building for data rearrangement
• Availability• Better prefetching
– Extreme Durability (1000-year time scale?)• Use of erasure coding and continuous repair
– Stability through Statistics• Use of redundancy to gain more predictable
behavior• Systems version of Thermodynamics!
– Continuous Dynamic Optimization of other sorts
Adapt
Compute
Monitor
OceanStore:20ROC/OceanStore Jan’01
Status: Introspection• Development of OIL framework for introspection: this
framework is operational– Collection facilities can observe all events in the system– Multiple aggregation models available
• Example 1: Clustering for prefetching– Currently builds Hidden Markov-model of access patterns
utilizing OIL framework– Almost there:
• Use models to better prefetch objects• Placement of replices assisted by bloom filters (almost)
• Example 2: Observation of network behavior– Framework for observation of network latencies– Adaptation of network topology: almost there
• Example 3: Grammer building for prefetching– Experiment of introspection at processor level– Talk later today about this (Mark Whitney)
OceanStore:21ROC/OceanStore Jan’01
Status:Medium Scale Test and
Emulation• Two medium clusters from IBM SUR Grant
– Each cluster 21 servers:• Each with two 1 GHz processors• One GByte of RAM, 73 GB of Disk
– 1 GB Switch per cluster– MIRNET switch
• Plan to have continuous OceanStore components running – in approximately 1 month
• Emulation technology: currently works– Able to simulate large-scale network by simulating
network latencies– Multiple OceanStore nodes emulated/node
OceanStore:22ROC/OceanStore Jan’01
Reality: Web Caching through
OceanStore
OceanStore:23ROC/OceanStore Jan’01
Day Dreams?(Becoming real)
• NFS File system built in OceanStore (Exists)– Still have to integrate ACLs– Update to latest prototype
• Windows Installable File system (Planning)– “USB Keys” hold cryptographic keys and personal
identity– Automatic downloading and verification of
filesystem
• IMAP OceanStore gateway (Planning)• Lotus Notes Domino Server
– Exploring use of work flow on top of OceanStore
OceanStore:24ROC/OceanStore Jan’01
OceanStore Conclusions• OceanStore: everyone’s data, one big utility
– Global Utility model for persistent data storage• Very Soon: Working OceanStore cluster!!!!
– Event-driven programming in Java– You will hear about components today and
tomorrow
• OceanStore assumptions:– Untrusted infrastructure with a responsible party– Mostly connected with conflict resolution– Continuous on-line optimization
OceanStore:25ROC/OceanStore Jan’01
For more info:
• OceanStore vision paper for ASPLOS 2000“OceanStore: An Architecture for Global-Scale
Persistent Storage”
• OceanStore paper on Maintenance (IEEE IC):
“Maintenance-Free Global Data Storage”
• Both available on OceanStore web site:http://oceanstore.cs.berkeley.edu/
Top Related