Societal Scale Information Systems Jim Demmel, Chief Scientist EECS and Math Depts.

Societal Scale Information Systems

Jim Demmel, Chief ScientistEECS and Math Depts.

www.citris.berkeley.edu

UC Santa CruzUC Santa Cruz

OutlineOutline ScopeScope Problems to be solvedProblems to be solved Overview of projectsOverview of projects

Societal-Scale Information System - SIS

“Client”

“Server”

Clusters

Massive Cluster

Gigabit Ethernet

MEMSSensors

Scalable, Reliable,Secure Services

InformationAppliances

Desirable SIS Features - Desirable SIS Features - Problems to solveProblems to solve

• Integrates diverse components seamlessly• Easy to build new services from existing ones• Adapts to interfaces/users• Non-stop, always connected• Secure

ProjectsProjects SaharaSahara

SService ervice AArchitecture for rchitecture for HHeterogeneous eterogeneous AAccess, ccess, RResources, and esources, and AApplicationspplications Katz, Joseph, StoicaKatz, Joseph, Stoica

Oceanstore and TapestryOceanstore and Tapestry Kubiatowicz, JosephKubiatowicz, Joseph

ROC and iStoreROC and iStore RRecovery ecovery OOriented riented CComputing and omputing and IIntelligent ntelligent StoreStore Patterson, Yelick, Fox (Stanford)Patterson, Yelick, Fox (Stanford)

Millennium and PlanetLabMillennium and PlanetLab Culler, Kubiatowicz, Stoica, ShenkerCuller, Kubiatowicz, Stoica, Shenker

Many faculty away on retreatsMany faculty away on retreats www.cs.berkeley.edu/~bmiller/saharaRetreat.htmwww.cs.berkeley.edu/~bmiller/saharaRetreat.htm www.cs.berkeley.edu/~bmiller/ROCretreat.htmwww.cs.berkeley.edu/~bmiller/ROCretreat.htm

The “Sahara” ProjectThe “Sahara” Project

SServiceervice AArchitecture forrchitecture for HHeterogeneouseterogeneous AAccess,ccess, RResources, andesources, and AApplicationspplications

www.cs.berkeley.edu/~bmiller/saharaRetreat.htmlwww.cs.berkeley.edu/~bmiller/saharaRetreat.html

Sprint

UserSalt Lake

City

Scenario: Service CompositionScenario: Service Composition

JAL

BabblefishTranslator

Zagat Guide

UI

User

NTTDoCoMo

RestaurantGuide Service

Tokyo

Sahara Research FocusSahara Research Focus

New mechanisms, techniques for end-to-end services w/ New mechanisms, techniques for end-to-end services w/ desirable, predictable, enforceable properties desirable, predictable, enforceable properties spanning spanning potentially distrusting service providerspotentially distrusting service providers Tech architecture for service composition & inter-operation across Tech architecture for service composition & inter-operation across

separate admin domains, supporting peering & brokering, and diverse separate admin domains, supporting peering & brokering, and diverse business, value-exchange, access-control modelsbusiness, value-exchange, access-control models

Functional elementsFunctional elements Service discovery Service-level agreements Service composition under constraints Redirection to a service instance Performance measurement infrastructure Constraints based on performance, access control,

accounting/billing/settlements Service modeling and verification

Technical ChallengesTechnical Challenges

Trust management and behavior verificationTrust management and behavior verification Meet promised functionality, performance, availabilityMeet promised functionality, performance, availability

Adapting to network dynamicsAdapting to network dynamics Actively respond to shifting server-side workloads and network Actively respond to shifting server-side workloads and network

congestion, based on pervasive monitoring & measurementcongestion, based on pervasive monitoring & measurement Awareness of network topology to drive service selectionAwareness of network topology to drive service selection

Adapting to user dynamicsAdapting to user dynamics Resource allocation responsive to client-side workload variations Resource allocation responsive to client-side workload variations

Resource provisioning and managementResource provisioning and management Service allocation and service placementService allocation and service placement

Interoperability across multiple service providersInteroperability across multiple service providers Interworking across similar services deployed by different providersInterworking across similar services deployed by different providers

Service Composition ModelsService Composition Models

CooperativeCooperative Individual component service providers interact in distributed Individual component service providers interact in distributed

fashion, with distributed responsibility, to provide an end-to-end fashion, with distributed responsibility, to provide an end-to-end composed servicecomposed service

BrokeredBrokered Single provider, the Single provider, the BrokerBroker, uses functionalities provided by , uses functionalities provided by

underlying service providers, encapsulates these to compose an underlying service providers, encapsulates these to compose an end-to-end serviceend-to-end service

ExamplesExamples Cooperative: roaming among separate mobile networksCooperative: roaming among separate mobile networks Brokered: JAL restaurant guideBrokered: JAL restaurant guide

Mechanisms for Service Composition (1)Mechanisms for Service Composition (1)

Measurement-based AdaptationMeasurement-based Adaptation ExamplesExamples

Host distance monitoring and estimation service Universal In-box: exchange network and server load Content Distribution Networks: redirect client to closest service

instance


Utility-based Resource Allocation MechanismsUtility-based Resource Allocation Mechanisms ExamplesExamples

Auctions to dynamically allocate resources; applied for spectrum/bandwidth resource assignments

Congestion pricing (same idea for power)– Voice port allocation to user-initiated calls in H.323

gateway/Voice over IP service management– Wireless LAN bandwidth allocation and management– H.323 gateway selection, redirection, and load balancing for

Voice over IP services


Trust Mgmt/Verification of Service & UsageTrust Mgmt/Verification of Service & Usage Authentication, Authorization, Accounting ServicesAuthentication, Authorization, Accounting Services

Credential transformations to enable cross-domain service invocation Federated admin domains with credential transformation rules based

on established agreements AAA server makes authorization decisions

Service Level Agreement VerificationService Level Agreement Verification Verification and usage monitoring to ensure properties specified in

SLA are being honored Border routers monitoring control traffic from different providers to

detect malicious route advertisements

OceanStoreOceanStoreGlobal-Scale Persistent StorageGlobal-Scale Persistent Storage

OceanStore Context: OceanStore Context: Ubiquitous ComputingUbiquitous Computing

Computing everywhere:Computing everywhere: Desktop, Laptop, PalmtopDesktop, Laptop, Palmtop Cars, CellphonesCars, Cellphones Shoes? Clothing? Walls? Shoes? Clothing? Walls?

Connectivity everywhere:Connectivity everywhere: Rapid growth of bandwidth in the interior of the netRapid growth of bandwidth in the interior of the net Broadband to the home and officeBroadband to the home and office Wireless technologies such as CMDA, Satelite, laserWireless technologies such as CMDA, Satelite, laser

Questions about information:Questions about information: Where is persistent information stored?Where is persistent information stored?

Want: Geographic independence for availability, durability, and freedom to adapt Want: Geographic independence for availability, durability, and freedom to adapt to circumstancesto circumstances

How is it protected?How is it protected? Want: Encryption for privacy, signatures for authenticity, and Byzantine Want: Encryption for privacy, signatures for authenticity, and Byzantine

commitment for integritycommitment for integrity Can we make it indestructible? Can we make it indestructible?

Want: Redundancy with continuous repair and redistribution for long-term Want: Redundancy with continuous repair and redistribution for long-term durabilitydurability

Is it hard to manage?Is it hard to manage? Want: automatic optimization, diagnosis and repairWant: automatic optimization, diagnosis and repair

Who owns the aggregate resouces?Who owns the aggregate resouces? Want: Utility Infrastructure!Want: Utility Infrastructure!

Pac Bell

Sprint

IBMAT&T

CanadianOceanStore

IBM

Utility-based InfrastructureUtility-based Infrastructure

Transparent data service provided by federationTransparent data service provided by federationof companies:of companies: Monthly fee paid to one service providerMonthly fee paid to one service provider Companies buy and sell capacity from each otherCompanies buy and sell capacity from each other

OceanStore AssumptionsOceanStore Assumptions Untrusted Infrastructure: Untrusted Infrastructure:

The OceanStore is comprised of untrusted componentsThe OceanStore is comprised of untrusted components Only ciphertext within the infrastructureOnly ciphertext within the infrastructure

Responsible Party:Responsible Party: Some organization (Some organization (i.e. service provider) i.e. service provider) guarantees that your data is guarantees that your data is

consistent and durableconsistent and durable Not trusted with Not trusted with contentcontent of data, merely its of data, merely its integrityintegrity

Mostly Well-Connected:Mostly Well-Connected: Data producers and consumers are connected to a high-bandwidth Data producers and consumers are connected to a high-bandwidth

network most of the timenetwork most of the time Exploit multicast for quicker consistency when possibleExploit multicast for quicker consistency when possible

Promiscuous Caching:Promiscuous Caching: Data may be cached anywhere, anytime Data may be cached anywhere, anytime

Optimistic Concurrency via Conflict Resolution:Optimistic Concurrency via Conflict Resolution: Avoid locking in the wide areaAvoid locking in the wide area Applications use object-based interface for updatesApplications use object-based interface for updates

First Implementation [Java]:First Implementation [Java]: Event-driven state-machine modelEvent-driven state-machine model Included ComponentsIncluded Components

Initial floating replica designInitial floating replica design Conflict resolution and Byzantine agreement

Routing facility (Tapestry)Routing facility (Tapestry) Bloom Filter location algorithm Plaxton-based locate and route data structures

Introspective gathering of tacit info and adaptationIntrospective gathering of tacit info and adaptation Language for introspective handler construction Clustering, prefetching, adaptation of network routing

Initial archival facilities Initial archival facilities Interleaved Reed-Solomon codes for fragmentation Methods for signing and validating fragments

Target ApplicationsTarget Applications Unix file-system interface under Linux (“legacy apps”)Unix file-system interface under Linux (“legacy apps”) Email application, proxy for web caches, streaming multimedia applicationsEmail application, proxy for web caches, streaming multimedia applications

OceanStore ConclusionsOceanStore Conclusions OceanStore: everyone’s data, one big utilityOceanStore: everyone’s data, one big utility

Global Utility model for persistent data storageGlobal Utility model for persistent data storage OceanStore assumptions:OceanStore assumptions:

Untrusted infrastructure with a responsible partyUntrusted infrastructure with a responsible party Mostly connected with conflict resolutionMostly connected with conflict resolution Continuous on-line optimizationContinuous on-line optimization

OceanStore properties:OceanStore properties: Provides security, privacy, and integrityProvides security, privacy, and integrity Provides extreme durabilityProvides extreme durability Lower maintenance cost through redundancy, continuous Lower maintenance cost through redundancy, continuous

adaptation, self-diagnosis and repairadaptation, self-diagnosis and repair Large scale system has good statistical propertiesLarge scale system has good statistical properties

Oceanstore PrototypeOceanstore PrototypeRunning with 5 other sites worldwideRunning with 5 other sites worldwide

Recovery-Oriented Computing PhilosophyRecovery-Oriented Computing Philosophy

• People/HW/SW failures are facts to cope with, not problems to solve (“Peres’s Law”)

• Improving recovery/repair improves availability– UnAvailability = MTTR (assuming MTTR << MTTF)

MTTF– 1/10th MTTR just as valuable as 10X MTBF• Recovery/repair is how we cope with

above facts• Since major Sys Admin job is recovery

after failure, ROC also helps with maintenance/TCO, and Total Cost of Ownership is 5-10X HW/SW

• www.cs.berkeley.edu/~bmiller/ROCretreat.htm

ROC approachROC approach1.1. Collect data to see why services failCollect data to see why services fail

• Operators cause > 50% failuresOperators cause > 50% failures

2.2. Create benchmarks to measure dependabilityCreate benchmarks to measure dependability• Benchmarks inspire and enable researchers, Benchmarks inspire and enable researchers,

name names to spur commercial improvementsname names to spur commercial improvements

3.3. Margin of Safety (from Civil Engineering)Margin of Safety (from Civil Engineering)• Overprovision to handle the unexpectedOverprovision to handle the unexpected

4.4. Create and Evaluate techniques to helpCreate and Evaluate techniques to help• Undo for system administrators in the fieldUndo for system administrators in the field• Partitioning to isolate errors, upgrade in the fieldPartitioning to isolate errors, upgrade in the field• Fault insertion to test emergency systems in the fieldFault insertion to test emergency systems in the field

Availability benchmarks quantify system behavior under Availability benchmarks quantify system behavior under failures, maintenance, recoveryfailures, maintenance, recovery

They requireThey require A realistic (fault) workload for the systemA realistic (fault) workload for the system Fault-injection to simulate failuresFault-injection to simulate failures Human operators to perform repairsHuman operators to perform repairs

New winner is fastest to recover, vs. fastestNew winner is fastest to recover, vs. fastest

Repair TimeQoS degradation

failure

normal behavior(99% conf.)

Availability benchmarking 101Availability benchmarking 101

Source: A. Brown, and D. Patterson, “Towards availability benchmarks: a case study of software RAID systems,” Proc. USENIX,

18-23 June 2000

ISTORE – ISTORE – Hardware Techniques for AvailabilityHardware Techniques for Availability

Cluster of Storage Oriented Nodes (SON)Cluster of Storage Oriented Nodes (SON) Scalable, tolerates partial failures, automatic redundancyScalable, tolerates partial failures, automatic redundancy

Heavily instrumented hardwareHeavily instrumented hardware Sensors for temp, vibration, humidity, power, intrusionSensors for temp, vibration, humidity, power, intrusion

Independent diagnostic processor on each nodeIndependent diagnostic processor on each node Remote control of power; collects environmental data for Remote control of power; collects environmental data for Diagnostic processors connected via independent networkDiagnostic processors connected via independent network

On-demand network partitioning/isolationOn-demand network partitioning/isolation Allows testing, repair of online systemAllows testing, repair of online system Managed by diagnostic processorManaged by diagnostic processor

Built-in fault injection capabilitiesBuilt-in fault injection capabilities Used for hardware introspectionUsed for hardware introspection Important for AME benchmarkingImportant for AME benchmarking

ISTOREISTORESoftware Techniques for AvailabilitySoftware Techniques for Availability

Reactive introspectionReactive introspection ““Mining” available system dataMining” available system data

Proactive introspectionProactive introspection Isolation + fault insertion => test recovery codeIsolation + fault insertion => test recovery code

Semantic redundancySemantic redundancy Use of coding and application-specific checkpointsUse of coding and application-specific checkpoints

Self-Scrubbing data structuresSelf-Scrubbing data structures Check (and repair?) complex distributed structuresCheck (and repair?) complex distributed structures

Load adaptation for performance faultsLoad adaptation for performance faults Dynamic load balancing for “regular” computationsDynamic load balancing for “regular” computations

BenchmarkingBenchmarking Define quantitative evaluations for AMEDefine quantitative evaluations for AME

ISTORE StatusISTORE Status

ISTORE HardwareISTORE Hardware All 80 Nodes (boards) manufacturedAll 80 Nodes (boards) manufactured PCB backplane: in layoutPCB backplane: in layout 32 node system running May 200232 node system running May 2002

SoftwareSoftware 2-node system running -- boots OS2-node system running -- boots OS Diagnostic Processor SW and device driver done Diagnostic Processor SW and device driver done Network striping done; fault adaptation ongoingNetwork striping done; fault adaptation ongoing Load balancing for performance heterogeneity doneLoad balancing for performance heterogeneity done

BenchmarkingBenchmarking Availability benchmark example completeAvailability benchmark example complete Initial maintainability benchmark complete, revised strategy underwayInitial maintainability benchmark complete, revised strategy underway

ISTORE PrototypeISTORE Prototype

UCB ClustersUCB Clusters Millennium Central ClusterMillennium Central Cluster

99 Dell 2300/6400/6450 Xeon Dual/Quad: 332 processors99 Dell 2300/6400/6450 Xeon Dual/Quad: 332 processors Total: 211GB memory, 3TB diskTotal: 211GB memory, 3TB disk Myrinet 2000 + 1000Mb fiber ethernet Myrinet 2000 + 1000Mb fiber ethernet

OceanStore/ROC cluster, Astro cluster, Math cluster, Cory cluster, moreOceanStore/ROC cluster, Astro cluster, Math cluster, Cory cluster, more CITRIS Cluster 1: 3/2002 deployment (Intel Donation)CITRIS Cluster 1: 3/2002 deployment (Intel Donation)

4 Dell Precision 730 Itanium Duals: 8 processors4 Dell Precision 730 Itanium Duals: 8 processors Total: 8GB memory, 128GB diskTotal: 8GB memory, 128GB disk Myrinet 2000 + 1000Mb copper ethernetMyrinet 2000 + 1000Mb copper ethernet

CITRIS Cluster 2: 9/2002 deployment (Intel Donation)CITRIS Cluster 2: 9/2002 deployment (Intel Donation) ~128 Dell McKinley class Duals: 256 processors~128 Dell McKinley class Duals: 256 processors Total: ~512GB memory, ~8TB diskTotal: ~512GB memory, ~8TB disk Myrinet 2000 + 1000Mb copper ethernetMyrinet 2000 + 1000Mb copper ethernet

Network Expansion Needed! - and underwayNetwork Expansion Needed! - and underway UCB shift from NorTel plus expansionUCB shift from NorTel plus expansion

Ganglia cluster management in Susie distributionGanglia cluster management in Susie distribution hundreds of companies using ithundreds of companies using it

CITRIS Network Rollout

Millennium Top UsersMillennium Top Users 800 users total on central cluster, many of which are CITRIS users800 users total on central cluster, many of which are CITRIS users 75 major users for 2/2002: average 65% total CPU utilization75 major users for 2/2002: average 65% total CPU utilization

Independent component analysis – machine learning algorithms (fbach)Independent component analysis – machine learning algorithms (fbach) Ns-2 a packet level network simulator (machi)Ns-2 a packet level network simulator (machi) parallel AI algorithms for controlling 4-legged robot (ang)parallel AI algorithms for controlling 4-legged robot (ang) Image recognition (lwalker) 2 hours on cluster vs. 2 weeks on local resourcesImage recognition (lwalker) 2 hours on cluster vs. 2 weeks on local resources Network simulations for infrastructure to track moving objects over a wide area Network simulations for infrastructure to track moving objects over a wide area

(mdenny)(mdenny) Analyzing trends in BGP routing tables (sagarwal)Analyzing trends in BGP routing tables (sagarwal) Boundary extraction and segmentation of natural images (dmartin)Boundary extraction and segmentation of natural images (dmartin) Optical simulation and high quality rendering (adamb)Optical simulation and high quality rendering (adamb) Titanium – compiler and runtime system design for high performance parallel Titanium – compiler and runtime system design for high performance parallel

programming languages (bonachea)programming languages (bonachea) AMANDA – neutrino detection from polar ice core samples (amanda)AMANDA – neutrino detection from polar ice core samples (amanda)

http://ganglia.millennium.berkeley.edu

Planet-Lab MotivationPlanet-Lab Motivation A new class of services & applications is emerging that spread A new class of services & applications is emerging that spread

over a sizable fraction of the webover a sizable fraction of the web CDNs as the first examplesCDNs as the first examples Peer-to-peer, ...Peer-to-peer, ...

Architectural components are beginning to emergeArchitectural components are beginning to emerge Distributable hash tables to provide scalable translationDistributable hash tables to provide scalable translation Distributed storage, caching, instrumentation, mapping, ...Distributed storage, caching, instrumentation, mapping, ...

The next internet will be created as an overlay on the current The next internet will be created as an overlay on the current oneone as did the last oneas did the last one it will be defined by its services, not its transportit will be defined by its services, not its transport

translation, storage, caching, event notification, management

There is NO vehicle to try out the nextThere is NO vehicle to try out the next n n great ideas in this area great ideas in this area

Structure of the PlanetLabStructure of the PlanetLab

>1000 viewpoints on the internet>1000 viewpoints on the internet 10-100 resource-rich sites at network crossroads10-100 resource-rich sites at network crossroads Typical use involves a slice across substantial subset of nodesTypical use involves a slice across substantial subset of nodes Dual-role by designDual-role by design

Research testbedResearch testbed large set of geographically distributed machines diverse & realistic network conditions classic ‘controlled’ sigcomm, infocomm studies

Deployment platformDeployment platform services: design evaluation client base -> composite services nodes: proxy path physical path make it useful to people

Initial Researchers (Mar 02)Initial Researchers (Mar 02)WashingtonWashington

Tom AndersonTom AndersonSteven GribbleSteven GribbleDavid WetherallDavid Wetherall

MITMITFrans KaashoekFrans KaashoekHari BalakrishnanHari BalakrishnanRobert MorrisRobert MorrisDavid AndersonDavid Anderson

BerkeleyBerkeleyIon StoicaIon StoicaJoe HelersteinJoe HelersteinEric BrewerEric Brewer

John KubiJohn Kubi

Intel ResearchDavid CullerTimothy RoscoeSylvia RatnasamyGaetano BorrielloSatyaMilan Milenkovic

DukeAmin VadatJeff Chase

PrincetonLarry PetersonRandy WangVivek Pai

Rice Peter Druschel

UtahJay Lepreau

CMUSrini SeshanHui Zhang

UCSDStefan Savage

ColumbiaAndrew

Campbell

ICIRScott ShenkerMark HandleyEddie Kohler

see http://www.cs.berkeley.edu/~culler/planetlab

Initial Planet-Lab Candidate SitesInitial Planet-Lab Candidate Sites

Intel BerkeleyIntel BerkeleyICIRICIR

MITMIT

PrincetonPrincetonCornellCornell

DukeDuke

UTUT

ColumbiaColumbiaUCSBUCSBUCBUCB

UCSDUCSDUCLAUCLA

UWUW

Intel SeattleIntel Seattle

KYKY

MelbourneMelbourne

CambridgeCambridge

HarvardHarvard

GITGIT

UppsalaUppsalaCopenhagenCopenhagen

CMUCMU

UPennUPennWIWI

ChicagoChicagoUtahUtah

Intel ORIntel OR

UBCUBC

WashuWashu

ISIISI

IntelIntel

RiceRice

BeijingBeijingTokyoTokyo

BarcelonaBarcelona

AmsterdamAmsterdamKarlsruheKarlsruhe

St. LouisSt. Louis

Planned as of July 2002Planned as of July 2002

Societal Scale Information Systems Jim Demmel, Chief Scientist EECS and Math Depts.

Documents

Transcript of Societal Scale Information Systems Jim Demmel, Chief Scientist EECS and Math Depts.