Atlas: An Infrastructure for Global Computing
description
Transcript of Atlas: An Infrastructure for Global Computing
Atlas: An Infrastructure for Atlas: An Infrastructure for Global ComputingGlobal Computing
PeoplePeople
Eric Baldeschwieler (UC Berkeley)Eric Baldeschwieler (UC Berkeley)
Bobby Blumofe (UT Austin)Bobby Blumofe (UT Austin)
Eric Brewer (UC Berkeley)Eric Brewer (UC Berkeley)
OutlineOutline
IntroductionIntroduction Programming modelProgramming model ArchitectureArchitecture ExamplesExamples DiscussionDiscussion Limitations & ConclusionLimitations & Conclusion
IntroductionIntroduction
Properties of a Internet computing Properties of a Internet computing infrastructureinfrastructure
ScalabilityScalability: to 10: to 1066 nodes nodes HeterogeneityHeterogeneity: of machines & OSs: of machines & OSs Fault toleranceFault tolerance: : completion probability completion probability
comparable to sequential programcomparable to sequential program Adaptive parallelismAdaptive parallelism: dynamic set of : dynamic set of
resourcesresources
Properties ...Properties ... SafetySafety: Hosts must be secure: Hosts must be secure
AnonymityAnonymity: Secure privacy of client: data & : Secure privacy of client: data &
programprogram
HierarchyHierarchy: Locality of communication (local : Locality of communication (local
bandwidth typically is higher)bandwidth typically is higher)
Ease of useEase of use: Minimize “costs” of participating.: Minimize “costs” of participating.
Reasonable performanceReasonable performance: Low overhead : Low overhead Benefit Benefit
from a small set of machines.from a small set of machines.
Introduction ...Introduction ...
Atlas combines mechanisms from:Atlas combines mechanisms from:– CilkCilk– JavaJava– with new mechanisms.with new mechanisms.
Java “ensures”:Java “ensures”:– heterogeneityheterogeneity– safetysafety
Introduction ...Introduction ...
Atlas:Atlas:
extendsextends Cilk’s work-stealing scheduler Cilk’s work-stealing scheduler
to a hierarchical Internet settingto a hierarchical Internet setting
usesuses Cilk-NOW’s mechanisms for: Cilk-NOW’s mechanisms for:
– adaptive parallelismadaptive parallelism
– fault tolerancefault tolerance
Programming ModelProgramming Model Applications are written in JavaApplications are written in Java
When a native library is used, heterogeneity When a native library is used, heterogeneity
is is limitedlimited to platforms that support it. to platforms that support it.
Programming model is: Programming model is: – a Java-based implementation of Cilk:a Java-based implementation of Cilk:
Non-blocking, explicit continuation passing threadsNon-blocking, explicit continuation passing threads
– a Unix-like URL-based file system & local caching a Unix-like URL-based file system & local caching
with coherence.with coherence.
ArchitectureArchitecture
ClientClient ManagerManager
ComputeServer
ComputeServer
ComputeServer
ComputeServer
ComputeServer
ComputeServer
Application (Java)
Runtime library
Java interpreter
Native libraries (C or C++)
Application (Java)
Runtime library
Java interpreter
Native libraries (C or C++)
Compute ServerBasic architecture
Architecture ...Architecture ...
Client is a Java Client is a Java applicationapplication
– connects to compute servers on machines connects to compute servers on machines
other than its manager’s.other than its manager’s.
Idle servers steal work from busy ones.Idle servers steal work from busy ones.
ArchitectureArchitecture
Compute server: Compute server:
– relinquishes control when there is non-relinquishes control when there is non-
Atlas work (a screensaver?)Atlas work (a screensaver?)
– Runs as a daemon:Runs as a daemon: workingworking
pings manager & siblings for work to stealpings manager & siblings for work to steal
Architecture: Porting AtlasArchitecture: Porting Atlas
A Java runtime systemA Java runtime system
Port:Port:
– natively written URL-based file natively written URL-based file
systemsystem
– some support routines. some support routines.
Hierarchical Work StealingHierarchical Work Stealing
ManagerManager
ComputeServer
ComputeServer
ComputeServer
ComputeServer
ComputeServer
ComputeServer
ManagerManager
ManagerManager
ManagerManager
ManagerManager
Hierarchical Work Stealing Hierarchical Work Stealing ......
Manager keeps track of when its subtree Manager keeps track of when its subtree is idleis idle
If manager’s subtree is idle,If manager’s subtree is idle,
manager steals work from its siblingsmanager steals work from its siblings If a subtree has “too much” work,If a subtree has “too much” work,
it “allows” work stealing from aboveit “allows” work stealing from aboveWhat is definition & implementation of “too What is definition & implementation of “too
much”?much”?
Hierarchical Work StealingHierarchical Work Stealing
The authors claim that proven The authors claim that proven properties of Cilk hold in this properties of Cilk hold in this hierarchical setting.hierarchical setting.
Goals:Goals:– Localize communicationLocalize communication
– Sub-trees map to domain hierarchySub-trees map to domain hierarchyAdministrators can control thread migration: Administrators can control thread migration:
– Outflow: Privacy Outflow: Privacy
– Inflow: Host securityInflow: Host security
ExamplesExamples
Fib: fine grained threadsFib: fine grained threads POV-Ray: coarse grained threadsPOV-Ray: coarse grained threads
Base 1 Node 3 Nodes 8 Nodes
Fib (24) 1.3 80 40 (2.0) 31 (2.6)
POV-Ray 20700 21000 - 2700 (7.8)
Numbers in ( ) are speedups over 1-node case.
Examples ...Examples ...
POV-Ray is POV-Ray is notnot written in Java written in Java
Partitioning Partitioning isis done in Java done in Java
8 nodes: only 2% overhead. 8 nodes: only 2% overhead.
What about larger P?What about larger P?
DiscussionDiscussion
ScalableScalable: Yes.: Yes.
HeterogeneityHeterogeneity: Incomplete until : Incomplete until
divorces itself from divorces itself from allall native libraries. native libraries.
SafetySafety: :
– Java: OK.Java: OK.
– Native libraries: ?Native libraries: ?
Discussion ...Discussion ...
Fault toleranceFault tolerance: A : A timed outtimed out thread is thread is
recomputed from a recomputed from a checkpointcheckpoint maintained maintained
by subtree by subtree (manager?)(manager?)
– What is affect on performance of What is affect on performance of
checkpointing?checkpointing?
Subtree rooted at a thread is its Subtree rooted at a thread is its
subcomputationsubcomputation..
Fault Tolerance ...Fault Tolerance ...
Subcomputations are Subcomputations are transactions:transactions:
Authors claim: side effects can be Authors claim: side effects can be
undoneundone
How does this relate to hierarchical How does this relate to hierarchical
work stealing?work stealing?
Discussion ...Discussion ...
AnonymityAnonymity: A host executing a stolen : A host executing a stolen
subtree cannot determine client.subtree cannot determine client.
– Managers are Managers are assumedassumed to be to be
trustworthytrustworthy
HierarchyHierarchy: Yes, via manager hierarchy.: Yes, via manager hierarchy.
Ease of useEase of use: Interface incomplete.: Interface incomplete.
– clients submit jobs via a special “shell”clients submit jobs via a special “shell”
Discussion ...Discussion ...
Adaptive parallelismAdaptive parallelism: :
– ““Owner” (?) of compute server sets a Owner” (?) of compute server sets a
policy that defines when server is idle.policy that defines when server is idle.
– How?How?
– When compute server becomes When compute server becomes
unavailable for Atlas work, all its sub-unavailable for Atlas work, all its sub-
computations are moved to another computations are moved to another
computer server.computer server.
Adaptive Parallelism ...Adaptive Parallelism ...
Moving a subcomputation requires updating Moving a subcomputation requires updating
information linking subcomputation to its:information linking subcomputation to its:
– parentparent
– childrenchildren
– How long does it take to retreat?How long does it take to retreat?
– Is sub-computation restarted? From checkpoint?Is sub-computation restarted? From checkpoint?
LimitationsLimitations
Atlas inherits tree-structured Atlas inherits tree-structured program limitation from Cilk. program limitation from Cilk. – But this is still a rich set!But this is still a rich set!
Generalizing to non-tree-structured Generalizing to non-tree-structured programs seems hard.programs seems hard.
No shared variables among threads.No shared variables among threads. Global file system is read-only.Global file system is read-only.
ConclusionConclusion
Jicos design goals = those for Atlas.Jicos design goals = those for Atlas.
Use JXTA to give Jicos a “file system”Use JXTA to give Jicos a “file system”
– Then, Jicos becomes Atlas’s heir.Then, Jicos becomes Atlas’s heir.