Societal Scale Information Systems Jim Demmel, Chief Scientist EECS and Math Depts.
description
Transcript of Societal Scale Information Systems Jim Demmel, Chief Scientist EECS and Math Depts.
Societal Scale Information Systems
Jim Demmel, Chief ScientistEECS and Math Depts.
www.citris.berkeley.edu
UC Santa CruzUC Santa Cruz
OutlineOutline ScopeScope Problems to be solvedProblems to be solved Overview of projectsOverview of projects
Societal-Scale Information System - SIS
“Client”
“Server”
Clusters
Massive Cluster
Gigabit Ethernet
MEMSSensors
Scalable, Reliable,Secure Services
InformationAppliances
Desirable SIS Features - Desirable SIS Features - Problems to solveProblems to solve
• Integrates diverse components seamlessly• Easy to build new services from existing ones• Adapts to interfaces/users• Non-stop, always connected• Secure
ProjectsProjects SaharaSahara
SService ervice AArchitecture for rchitecture for HHeterogeneous eterogeneous AAccess, ccess, RResources, and esources, and AApplicationspplications Katz, Joseph, StoicaKatz, Joseph, Stoica
Oceanstore and TapestryOceanstore and Tapestry Kubiatowicz, JosephKubiatowicz, Joseph
ROC and iStoreROC and iStore RRecovery ecovery OOriented riented CComputing and omputing and IIntelligent ntelligent StoreStore Patterson, Yelick, Fox (Stanford)Patterson, Yelick, Fox (Stanford)
Millennium and PlanetLabMillennium and PlanetLab Culler, Kubiatowicz, Stoica, ShenkerCuller, Kubiatowicz, Stoica, Shenker
Many faculty away on retreatsMany faculty away on retreats www.cs.berkeley.edu/~bmiller/saharaRetreat.htmwww.cs.berkeley.edu/~bmiller/saharaRetreat.htm www.cs.berkeley.edu/~bmiller/ROCretreat.htmwww.cs.berkeley.edu/~bmiller/ROCretreat.htm
The “Sahara” ProjectThe “Sahara” Project
SServiceervice AArchitecture forrchitecture for HHeterogeneouseterogeneous AAccess,ccess, RResources, andesources, and AApplicationspplications
www.cs.berkeley.edu/~bmiller/saharaRetreat.htmlwww.cs.berkeley.edu/~bmiller/saharaRetreat.html
Sprint
UserSalt Lake
City
Scenario: Service CompositionScenario: Service Composition
JAL
BabblefishTranslator
Zagat Guide
UI
User
NTTDoCoMo
RestaurantGuide Service
Tokyo
Sahara Research FocusSahara Research Focus
New mechanisms, techniques for end-to-end services w/ New mechanisms, techniques for end-to-end services w/ desirable, predictable, enforceable properties desirable, predictable, enforceable properties spanning spanning potentially distrusting service providerspotentially distrusting service providers Tech architecture for service composition & inter-operation across Tech architecture for service composition & inter-operation across
separate admin domains, supporting peering & brokering, and diverse separate admin domains, supporting peering & brokering, and diverse business, value-exchange, access-control modelsbusiness, value-exchange, access-control models
Functional elementsFunctional elements Service discovery Service-level agreements Service composition under constraints Redirection to a service instance Performance measurement infrastructure Constraints based on performance, access control,
accounting/billing/settlements Service modeling and verification
Technical ChallengesTechnical Challenges
Trust management and behavior verificationTrust management and behavior verification Meet promised functionality, performance, availabilityMeet promised functionality, performance, availability
Adapting to network dynamicsAdapting to network dynamics Actively respond to shifting server-side workloads and network Actively respond to shifting server-side workloads and network
congestion, based on pervasive monitoring & measurementcongestion, based on pervasive monitoring & measurement Awareness of network topology to drive service selectionAwareness of network topology to drive service selection
Adapting to user dynamicsAdapting to user dynamics Resource allocation responsive to client-side workload variations Resource allocation responsive to client-side workload variations
Resource provisioning and managementResource provisioning and management Service allocation and service placementService allocation and service placement
Interoperability across multiple service providersInteroperability across multiple service providers Interworking across similar services deployed by different providersInterworking across similar services deployed by different providers
Service Composition ModelsService Composition Models
CooperativeCooperative Individual component service providers interact in distributed Individual component service providers interact in distributed
fashion, with distributed responsibility, to provide an end-to-end fashion, with distributed responsibility, to provide an end-to-end composed servicecomposed service
BrokeredBrokered Single provider, the Single provider, the BrokerBroker, uses functionalities provided by , uses functionalities provided by
underlying service providers, encapsulates these to compose an underlying service providers, encapsulates these to compose an end-to-end serviceend-to-end service
ExamplesExamples Cooperative: roaming among separate mobile networksCooperative: roaming among separate mobile networks Brokered: JAL restaurant guideBrokered: JAL restaurant guide
Mechanisms for Service Composition (1)Mechanisms for Service Composition (1)
Measurement-based AdaptationMeasurement-based Adaptation ExamplesExamples
Host distance monitoring and estimation service Universal In-box: exchange network and server load Content Distribution Networks: redirect client to closest service
instance
Mechanisms for Service Composition (2)Mechanisms for Service Composition (2)
Utility-based Resource Allocation MechanismsUtility-based Resource Allocation Mechanisms ExamplesExamples
Auctions to dynamically allocate resources; applied for spectrum/bandwidth resource assignments
Congestion pricing (same idea for power)– Voice port allocation to user-initiated calls in H.323
gateway/Voice over IP service management– Wireless LAN bandwidth allocation and management– H.323 gateway selection, redirection, and load balancing for
Voice over IP services
Mechanisms for Service Composition (3)Mechanisms for Service Composition (3)
Trust Mgmt/Verification of Service & UsageTrust Mgmt/Verification of Service & Usage Authentication, Authorization, Accounting ServicesAuthentication, Authorization, Accounting Services
Credential transformations to enable cross-domain service invocation Federated admin domains with credential transformation rules based
on established agreements AAA server makes authorization decisions
Service Level Agreement VerificationService Level Agreement Verification Verification and usage monitoring to ensure properties specified in
SLA are being honored Border routers monitoring control traffic from different providers to
detect malicious route advertisements
OceanStoreOceanStoreGlobal-Scale Persistent StorageGlobal-Scale Persistent Storage
OceanStore Context: OceanStore Context: Ubiquitous ComputingUbiquitous Computing
Computing everywhere:Computing everywhere: Desktop, Laptop, PalmtopDesktop, Laptop, Palmtop Cars, CellphonesCars, Cellphones Shoes? Clothing? Walls? Shoes? Clothing? Walls?
Connectivity everywhere:Connectivity everywhere: Rapid growth of bandwidth in the interior of the netRapid growth of bandwidth in the interior of the net Broadband to the home and officeBroadband to the home and office Wireless technologies such as CMDA, Satelite, laserWireless technologies such as CMDA, Satelite, laser
Questions about information:Questions about information: Where is persistent information stored?Where is persistent information stored?
Want: Geographic independence for availability, durability, and freedom to adapt Want: Geographic independence for availability, durability, and freedom to adapt to circumstancesto circumstances
How is it protected?How is it protected? Want: Encryption for privacy, signatures for authenticity, and Byzantine Want: Encryption for privacy, signatures for authenticity, and Byzantine
commitment for integritycommitment for integrity Can we make it indestructible? Can we make it indestructible?
Want: Redundancy with continuous repair and redistribution for long-term Want: Redundancy with continuous repair and redistribution for long-term durabilitydurability
Is it hard to manage?Is it hard to manage? Want: automatic optimization, diagnosis and repairWant: automatic optimization, diagnosis and repair
Who owns the aggregate resouces?Who owns the aggregate resouces? Want: Utility Infrastructure!Want: Utility Infrastructure!
Pac Bell
Sprint
IBMAT&T
CanadianOceanStore
IBM
Utility-based InfrastructureUtility-based Infrastructure
Transparent data service provided by federationTransparent data service provided by federationof companies:of companies: Monthly fee paid to one service providerMonthly fee paid to one service provider Companies buy and sell capacity from each otherCompanies buy and sell capacity from each other
OceanStore AssumptionsOceanStore Assumptions Untrusted Infrastructure: Untrusted Infrastructure:
The OceanStore is comprised of untrusted componentsThe OceanStore is comprised of untrusted components Only ciphertext within the infrastructureOnly ciphertext within the infrastructure
Responsible Party:Responsible Party: Some organization (Some organization (i.e. service provider) i.e. service provider) guarantees that your data is guarantees that your data is
consistent and durableconsistent and durable Not trusted with Not trusted with contentcontent of data, merely its of data, merely its integrityintegrity
Mostly Well-Connected:Mostly Well-Connected: Data producers and consumers are connected to a high-bandwidth Data producers and consumers are connected to a high-bandwidth
network most of the timenetwork most of the time Exploit multicast for quicker consistency when possibleExploit multicast for quicker consistency when possible
Promiscuous Caching:Promiscuous Caching: Data may be cached anywhere, anytime Data may be cached anywhere, anytime
Optimistic Concurrency via Conflict Resolution:Optimistic Concurrency via Conflict Resolution: Avoid locking in the wide areaAvoid locking in the wide area Applications use object-based interface for updatesApplications use object-based interface for updates
First Implementation [Java]:First Implementation [Java]: Event-driven state-machine modelEvent-driven state-machine model Included ComponentsIncluded Components
Initial floating replica designInitial floating replica design Conflict resolution and Byzantine agreement
Routing facility (Tapestry)Routing facility (Tapestry) Bloom Filter location algorithm Plaxton-based locate and route data structures
Introspective gathering of tacit info and adaptationIntrospective gathering of tacit info and adaptation Language for introspective handler construction Clustering, prefetching, adaptation of network routing
Initial archival facilities Initial archival facilities Interleaved Reed-Solomon codes for fragmentation Methods for signing and validating fragments
Target ApplicationsTarget Applications Unix file-system interface under Linux (“legacy apps”)Unix file-system interface under Linux (“legacy apps”) Email application, proxy for web caches, streaming multimedia applicationsEmail application, proxy for web caches, streaming multimedia applications
OceanStore ConclusionsOceanStore Conclusions OceanStore: everyone’s data, one big utilityOceanStore: everyone’s data, one big utility
Global Utility model for persistent data storageGlobal Utility model for persistent data storage OceanStore assumptions:OceanStore assumptions:
Untrusted infrastructure with a responsible partyUntrusted infrastructure with a responsible party Mostly connected with conflict resolutionMostly connected with conflict resolution Continuous on-line optimizationContinuous on-line optimization
OceanStore properties:OceanStore properties: Provides security, privacy, and integrityProvides security, privacy, and integrity Provides extreme durabilityProvides extreme durability Lower maintenance cost through redundancy, continuous Lower maintenance cost through redundancy, continuous
adaptation, self-diagnosis and repairadaptation, self-diagnosis and repair Large scale system has good statistical propertiesLarge scale system has good statistical properties
Oceanstore PrototypeOceanstore PrototypeRunning with 5 other sites worldwideRunning with 5 other sites worldwide
Recovery-Oriented Computing PhilosophyRecovery-Oriented Computing Philosophy
• People/HW/SW failures are facts to cope with, not problems to solve (“Peres’s Law”)
• Improving recovery/repair improves availability– UnAvailability = MTTR (assuming MTTR << MTTF)
MTTF– 1/10th MTTR just as valuable as 10X MTBF• Recovery/repair is how we cope with
above facts• Since major Sys Admin job is recovery
after failure, ROC also helps with maintenance/TCO, and Total Cost of Ownership is 5-10X HW/SW
• www.cs.berkeley.edu/~bmiller/ROCretreat.htm
ROC approachROC approach1.1. Collect data to see why services failCollect data to see why services fail
• Operators cause > 50% failuresOperators cause > 50% failures
2.2. Create benchmarks to measure dependabilityCreate benchmarks to measure dependability• Benchmarks inspire and enable researchers, Benchmarks inspire and enable researchers,
name names to spur commercial improvementsname names to spur commercial improvements
3.3. Margin of Safety (from Civil Engineering)Margin of Safety (from Civil Engineering)• Overprovision to handle the unexpectedOverprovision to handle the unexpected
4.4. Create and Evaluate techniques to helpCreate and Evaluate techniques to help• Undo for system administrators in the fieldUndo for system administrators in the field• Partitioning to isolate errors, upgrade in the fieldPartitioning to isolate errors, upgrade in the field• Fault insertion to test emergency systems in the fieldFault insertion to test emergency systems in the field
Availability benchmarks quantify system behavior under Availability benchmarks quantify system behavior under failures, maintenance, recoveryfailures, maintenance, recovery
They requireThey require A realistic (fault) workload for the systemA realistic (fault) workload for the system Fault-injection to simulate failuresFault-injection to simulate failures Human operators to perform repairsHuman operators to perform repairs
New winner is fastest to recover, vs. fastestNew winner is fastest to recover, vs. fastest
Repair TimeQoS degradation
failure
normal behavior(99% conf.)
Availability benchmarking 101Availability benchmarking 101
Source: A. Brown, and D. Patterson, “Towards availability benchmarks: a case study of software RAID systems,” Proc. USENIX,
18-23 June 2000
ISTORE – ISTORE – Hardware Techniques for AvailabilityHardware Techniques for Availability
Cluster of Storage Oriented Nodes (SON)Cluster of Storage Oriented Nodes (SON) Scalable, tolerates partial failures, automatic redundancyScalable, tolerates partial failures, automatic redundancy
Heavily instrumented hardwareHeavily instrumented hardware Sensors for temp, vibration, humidity, power, intrusionSensors for temp, vibration, humidity, power, intrusion
Independent diagnostic processor on each nodeIndependent diagnostic processor on each node Remote control of power; collects environmental data for Remote control of power; collects environmental data for Diagnostic processors connected via independent networkDiagnostic processors connected via independent network
On-demand network partitioning/isolationOn-demand network partitioning/isolation Allows testing, repair of online systemAllows testing, repair of online system Managed by diagnostic processorManaged by diagnostic processor
Built-in fault injection capabilitiesBuilt-in fault injection capabilities Used for hardware introspectionUsed for hardware introspection Important for AME benchmarkingImportant for AME benchmarking
ISTOREISTORESoftware Techniques for AvailabilitySoftware Techniques for Availability
Reactive introspectionReactive introspection ““Mining” available system dataMining” available system data
Proactive introspectionProactive introspection Isolation + fault insertion => test recovery codeIsolation + fault insertion => test recovery code
Semantic redundancySemantic redundancy Use of coding and application-specific checkpointsUse of coding and application-specific checkpoints
Self-Scrubbing data structuresSelf-Scrubbing data structures Check (and repair?) complex distributed structuresCheck (and repair?) complex distributed structures
Load adaptation for performance faultsLoad adaptation for performance faults Dynamic load balancing for “regular” computationsDynamic load balancing for “regular” computations
BenchmarkingBenchmarking Define quantitative evaluations for AMEDefine quantitative evaluations for AME
ISTORE StatusISTORE Status
ISTORE HardwareISTORE Hardware All 80 Nodes (boards) manufacturedAll 80 Nodes (boards) manufactured PCB backplane: in layoutPCB backplane: in layout 32 node system running May 200232 node system running May 2002
SoftwareSoftware 2-node system running -- boots OS2-node system running -- boots OS Diagnostic Processor SW and device driver done Diagnostic Processor SW and device driver done Network striping done; fault adaptation ongoingNetwork striping done; fault adaptation ongoing Load balancing for performance heterogeneity doneLoad balancing for performance heterogeneity done
BenchmarkingBenchmarking Availability benchmark example completeAvailability benchmark example complete Initial maintainability benchmark complete, revised strategy underwayInitial maintainability benchmark complete, revised strategy underway
ISTORE PrototypeISTORE Prototype
UCB ClustersUCB Clusters Millennium Central ClusterMillennium Central Cluster
99 Dell 2300/6400/6450 Xeon Dual/Quad: 332 processors99 Dell 2300/6400/6450 Xeon Dual/Quad: 332 processors Total: 211GB memory, 3TB diskTotal: 211GB memory, 3TB disk Myrinet 2000 + 1000Mb fiber ethernet Myrinet 2000 + 1000Mb fiber ethernet
OceanStore/ROC cluster, Astro cluster, Math cluster, Cory cluster, moreOceanStore/ROC cluster, Astro cluster, Math cluster, Cory cluster, more CITRIS Cluster 1: 3/2002 deployment (Intel Donation)CITRIS Cluster 1: 3/2002 deployment (Intel Donation)
4 Dell Precision 730 Itanium Duals: 8 processors4 Dell Precision 730 Itanium Duals: 8 processors Total: 8GB memory, 128GB diskTotal: 8GB memory, 128GB disk Myrinet 2000 + 1000Mb copper ethernetMyrinet 2000 + 1000Mb copper ethernet
CITRIS Cluster 2: 9/2002 deployment (Intel Donation)CITRIS Cluster 2: 9/2002 deployment (Intel Donation) ~128 Dell McKinley class Duals: 256 processors~128 Dell McKinley class Duals: 256 processors Total: ~512GB memory, ~8TB diskTotal: ~512GB memory, ~8TB disk Myrinet 2000 + 1000Mb copper ethernetMyrinet 2000 + 1000Mb copper ethernet
Network Expansion Needed! - and underwayNetwork Expansion Needed! - and underway UCB shift from NorTel plus expansionUCB shift from NorTel plus expansion
Ganglia cluster management in Susie distributionGanglia cluster management in Susie distribution hundreds of companies using ithundreds of companies using it
CITRIS Network Rollout
Millennium Top UsersMillennium Top Users 800 users total on central cluster, many of which are CITRIS users800 users total on central cluster, many of which are CITRIS users 75 major users for 2/2002: average 65% total CPU utilization75 major users for 2/2002: average 65% total CPU utilization
Independent component analysis – machine learning algorithms (fbach)Independent component analysis – machine learning algorithms (fbach) Ns-2 a packet level network simulator (machi)Ns-2 a packet level network simulator (machi) parallel AI algorithms for controlling 4-legged robot (ang)parallel AI algorithms for controlling 4-legged robot (ang) Image recognition (lwalker) 2 hours on cluster vs. 2 weeks on local resourcesImage recognition (lwalker) 2 hours on cluster vs. 2 weeks on local resources Network simulations for infrastructure to track moving objects over a wide area Network simulations for infrastructure to track moving objects over a wide area
(mdenny)(mdenny) Analyzing trends in BGP routing tables (sagarwal)Analyzing trends in BGP routing tables (sagarwal) Boundary extraction and segmentation of natural images (dmartin)Boundary extraction and segmentation of natural images (dmartin) Optical simulation and high quality rendering (adamb)Optical simulation and high quality rendering (adamb) Titanium – compiler and runtime system design for high performance parallel Titanium – compiler and runtime system design for high performance parallel
programming languages (bonachea)programming languages (bonachea) AMANDA – neutrino detection from polar ice core samples (amanda)AMANDA – neutrino detection from polar ice core samples (amanda)
http://ganglia.millennium.berkeley.edu
Planet-Lab MotivationPlanet-Lab Motivation A new class of services & applications is emerging that spread A new class of services & applications is emerging that spread
over a sizable fraction of the webover a sizable fraction of the web CDNs as the first examplesCDNs as the first examples Peer-to-peer, ...Peer-to-peer, ...
Architectural components are beginning to emergeArchitectural components are beginning to emerge Distributable hash tables to provide scalable translationDistributable hash tables to provide scalable translation Distributed storage, caching, instrumentation, mapping, ...Distributed storage, caching, instrumentation, mapping, ...
The next internet will be created as an overlay on the current The next internet will be created as an overlay on the current oneone as did the last oneas did the last one it will be defined by its services, not its transportit will be defined by its services, not its transport
translation, storage, caching, event notification, management
There is NO vehicle to try out the nextThere is NO vehicle to try out the next n n great ideas in this area great ideas in this area
Structure of the PlanetLabStructure of the PlanetLab
>1000 viewpoints on the internet>1000 viewpoints on the internet 10-100 resource-rich sites at network crossroads10-100 resource-rich sites at network crossroads Typical use involves a slice across substantial subset of nodesTypical use involves a slice across substantial subset of nodes Dual-role by designDual-role by design
Research testbedResearch testbed large set of geographically distributed machines diverse & realistic network conditions classic ‘controlled’ sigcomm, infocomm studies
Deployment platformDeployment platform services: design evaluation client base -> composite services nodes: proxy path physical path make it useful to people
Initial Researchers (Mar 02)Initial Researchers (Mar 02)WashingtonWashington
Tom AndersonTom AndersonSteven GribbleSteven GribbleDavid WetherallDavid Wetherall
MITMITFrans KaashoekFrans KaashoekHari BalakrishnanHari BalakrishnanRobert MorrisRobert MorrisDavid AndersonDavid Anderson
BerkeleyBerkeleyIon StoicaIon StoicaJoe HelersteinJoe HelersteinEric BrewerEric Brewer
John KubiJohn Kubi
Intel ResearchDavid CullerTimothy RoscoeSylvia RatnasamyGaetano BorrielloSatyaMilan Milenkovic
DukeAmin VadatJeff Chase
PrincetonLarry PetersonRandy WangVivek Pai
Rice Peter Druschel
UtahJay Lepreau
CMUSrini SeshanHui Zhang
UCSDStefan Savage
ColumbiaAndrew
Campbell
ICIRScott ShenkerMark HandleyEddie Kohler
see http://www.cs.berkeley.edu/~culler/planetlab
Initial Planet-Lab Candidate SitesInitial Planet-Lab Candidate Sites
Intel BerkeleyIntel BerkeleyICIRICIR
MITMIT
PrincetonPrincetonCornellCornell
DukeDuke
UTUT
ColumbiaColumbiaUCSBUCSBUCBUCB
UCSDUCSDUCLAUCLA
UWUW
Intel SeattleIntel Seattle
KYKY
MelbourneMelbourne
CambridgeCambridge
HarvardHarvard
GITGIT
UppsalaUppsalaCopenhagenCopenhagen
CMUCMU
UPennUPennWIWI
ChicagoChicagoUtahUtah
Intel ORIntel OR
UBCUBC
WashuWashu
ISIISI
IntelIntel
RiceRice
BeijingBeijingTokyoTokyo
BarcelonaBarcelona
AmsterdamAmsterdamKarlsruheKarlsruhe
St. LouisSt. Louis
Planned as of July 2002Planned as of July 2002