PICS - a Performance-analysis-based Introspective Control ... · PICS - a...
Transcript of PICS - a Performance-analysis-based Introspective Control ... · PICS - a...
PICS - a Performance-analysis-based IntrospectiveControl System to Steer Parallel Applications
Yanhua Sun, Jonathan Lifflander, Laxmikant V. Kale
Parallel Programming LaboratoryUniversity of Illinois at Urbana-Champaign
June 10, 2014
Yanhua Sun Parallel Programming Laboratory, UIUC 1/25
Motivation
1 Modern parallel computer systems are becoming extremely complexdue to network topologies, hierarchical storage systems,heterogeneous processing units, etc.
2 Obtaining the best performance is challenging.
3 Moreover, multiple configurations for the same application.
512
1 2 4 8 16 32 64
tim
e(us)
Number of messages
time of using different number of messages to send data
1M (f:0.03125)
Yanhua Sun Parallel Programming Laboratory, UIUC 2/25
Motivation
1 Modern parallel computer systems are becoming extremely complexdue to network topologies, hierarchical storage systems,heterogeneous processing units, etc.
2 Obtaining the best performance is challenging.
3 Moreover, multiple configurations for the same application.
512
1 2 4 8 16 32 64
tim
e(us)
Number of messages
time of using different number of messages to send data
1M (f:0.03125)
Yanhua Sun Parallel Programming Laboratory, UIUC 2/25
Introspection and Adaptivity
General Observation
Configurations of tunable parameters in the runtime system andapplications significantly affect the performance.
Top Ten Exascale Research Challenges in DOE Report
”Introspection and automatic adaptation is listed as significant researchtopic to achieve the performance goal on exascale computers.”
Statement
This work addresses the problem of how to improve both parallelprogramming productivity and performance by letting applications/runtimeexpose tunable parameters and letting the control system figure out theoptimal configurations of these parameters.
Yanhua Sun Parallel Programming Laboratory, UIUC 3/25
Introspection and Adaptivity
General Observation
Configurations of tunable parameters in the runtime system andapplications significantly affect the performance.
Top Ten Exascale Research Challenges in DOE Report
”Introspection and automatic adaptation is listed as significant researchtopic to achieve the performance goal on exascale computers.”
Statement
This work addresses the problem of how to improve both parallelprogramming productivity and performance by letting applications/runtimeexpose tunable parameters and letting the control system figure out theoptimal configurations of these parameters.
Yanhua Sun Parallel Programming Laboratory, UIUC 3/25
Related work
Autotuning frameworks : generate multiple implementations (FFTW)
Autopilot[Ribler et al.(1998)]: fuzzy logic rules, grid applications,resource managements
MATE [Morajko 2006] : fully automatic tuning, performance model
Active Harmony[Chung and Hollingsworth(2006)] : heuristicalgorithms
SEEC: A General and Extensible Framework for Self-AwareComputing[Henry Homann (2010,2011,2013)]
Yanhua Sun Parallel Programming Laboratory, UIUC 4/25
Our Approach
HPC applications on large scale
Not rely on performance models
Richer set of tunable parameters due to the powerful intelligentruntime system
Not only application configurations are tunned, but also the runtimesystem itself
Automatic performance analysis accelerates steering
Yanhua Sun Parallel Programming Laboratory, UIUC 5/25
Outline
Overview of PICS framework
Control points in the runtime system and applications
Automatic performance analysis to accelerate steering
APIs implemented in Charm++
Results of benchmarks and applications
Yanhua Sun Parallel Programming Laboratory, UIUC 6/25
Overview of PICS framework
Performance instrumenta/on
Automa/c performance analysis
Run/me control points
Run/me reconfigura/on
Controller
Mini apps Real‐world applica/ons
Applica/on control points
Applica/on reconfigura/on
PICS
Adap/ve run/me system
applica/ons
Performance data
Expert knowledge rules
Yanhua Sun Parallel Programming Laboratory, UIUC 7/25
Control Points
Control points
Control points are tunable parameters for application and runtime tointeract with control system. First proposed in Dooley’s research.
1 Name, Values : default, min, max
2 Movement unit: +1,×23 Effects, directions
Degree of parallelismGrainsizePriorityMemory usageGPU loadMessage sizeNumber of messagesother effects
Yanhua Sun Parallel Programming Laboratory, UIUC 8/25
Application and Runtime Control Points
Application
1 Application specific control points provided by users
2 Applications should be able to reconfigure to use new values
Runtime1 Traditionally, configurations for the runtime system do not change
2 Configurations for the runtime system itself should be tunable
1 Registered by runtime itself
2 Requires no change from applications
3 Affect all applications
Yanhua Sun Parallel Programming Laboratory, UIUC 9/25
Observe Program Behaviors
Record all events
Events : begin idle, end idleFunctions: name, begin execution, end executionCommunication : message creation, size, source/destinationHardware counters
Module link, no source code modification
Performance summary data
Yanhua Sun Parallel Programming Laboratory, UIUC 10/25
Automatically Analyze the Performance
Many control points are registered. How to reduce the search space?
Performance Analysis - Identify Program Problems
Decomposition
Mapping
Scheduling
Yanhua Sun Parallel Programming Laboratory, UIUC 11/25
Automatically Analyze the Performance
Many control points are registered. How to reduce the search space?Performance Analysis - Identify Program Problems
Decomposition
Mapping
Scheduling
Yanhua Sun Parallel Programming Laboratory, UIUC 11/25
Decomposition Characteristics
Decomposit ion
problem?
High cache miss rate
(1)too big
en t ry me thod
Bytes per
message low
too much
communicat ion
on one object
Decrease
grain size
(2)too big
single object
(3) too much
critical path
(4)too few objects
per PE
Increase
grain sizeReplicate the objects
Yanhua Sun Parallel Programming Laboratory, UIUC 12/25
Mapping Characteristics
Mapping problem?
load imbalance
too much
communicat ion
on one PE
Communica t ion t ime >>
LogP model time
too much externa l
communicat ion
Load
balancer
Remap
Topology aware mapping
Yanhua Sun Parallel Programming Laboratory, UIUC 13/25
Scheduling Characteristics
scheduling problem?
Critical tasks
a re de layed
Prior i t ize the tasks
Yanhua Sun Parallel Programming Laboratory, UIUC 14/25
Other Characteristics
other problems?
Bytes per
message low
Reduction
broadcas tLong latency
Aggregate
MessageCollectives
Compress
m e s s a g e
Yanhua Sun Parallel Programming Laboratory, UIUC 15/25
Correlate Performance with Control Points
Performance
summary
CPU Util ization > 90% O v e r h e a d > 1 0 % Idle >10%
Sequent ia l
performance?
Cache Miss
> 1 0 %
Decrease
grain size
Small
en t ry me thods
Small Bytes
pe r message
Increase
grain size
Decomposit ion
problem?Mapping problem? Scheduling problem?Others?
Longer
e n t r y
m e t h o d
Larger
single
object
Long
critical
p a t h
Few
objects
per PE
Large
communicat ion
on one object
Decrease
grain size
Load imbalance
Large
communicat ion
on one PE
Communicat ion
t i m e > >
model t ime
Large
ex te rna l
communicat ion
Load
balancer
RemapCompress
m e s s a g e
Critical
t a s k s
are de layed
Prior i t ize the tasks
Large Bytes
pe r message
Long reduction
broadcas t
Long latency
for big msgs
Increase
aggrega t ion
threshold
Decrease
aggrega t ion
threshold
Collectives Replicate objects Topology aware mapping
One box can have multiple children
One egg can have multiple parents
Yanhua Sun Parallel Programming Laboratory, UIUC 16/25
Correlate Performance with Control Points
Traverse the tree using the performance summary results
performance results ⇒ solutions
solution ⇒ effect of control points
What control points to tune, in which direction!
How much?
grainsize : MaxObjLoadAvgLoad
Feed results into the control points database
Yanhua Sun Parallel Programming Laboratory, UIUC 17/25
Control System APIs
Implemented in Charm++, over-decomposition, asynchronous,message-driven model. (http://charm.cs.uiuc.edu/)
t y p ed e f s t r u c t C o n t r o l P o i n t t{
cha r name [ 3 0 ] ;enum TP DATATYPE data type ;doub l e d e f a u l tV a l u e ;doub l e c u r r e n tVa l u e ;doub l e minValue ;doub l e maxValue ;doub l e be s tVa l u e ;doub l e moveUnit ;i n t moveOP ;i n t e f f e c t ;i n t e f f e c t D i r e c t i o n ;i n t s t r a t e g y ;i n t entryEP ;i n t ob j e c t ID ;
} Con t r o lPo i n t ;
Yanhua Sun Parallel Programming Laboratory, UIUC 18/25
APIs for applications
vo i d r e g i s t e r C o n t r o l P o i n t ( Con t r o lPo i n t ∗ tp ) ;
v o i d s t a r t S t e p ( ) ;v o i d endStep ( ) ;
doub l e getTunedParameter ( con s t cha r ∗name , boo l ∗ v a l i d ) ;
Yanhua Sun Parallel Programming Laboratory, UIUC 19/25
Experimental Results of Benchmarks and Applications
1 Control points
2 Performance problems
3 Bluegene/Q machine, Cray XE6 machine
Yanhua Sun Parallel Programming Laboratory, UIUC 20/25
Tuning Message Pipeline
Control point: number of pipeline messages
0
2
4
6
8
10
12
14
16
10 20 30 40 50 60 70 80 0
2
4
6
8
10
12
14
16tim
este
p(m
s/st
ep)
num
ber o
f pip
elin
e m
essa
ges
step
timestep(less work)timestep(more work)
pipeline(less work)pipeline(more work)
Figure: Tuning the number of pipeline messages
Yanhua Sun Parallel Programming Laboratory, UIUC 21/25
Communication Bottleneck in ChaNGa
Control points: number of mirrors
1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2
5 10 15 20 25
s/st
ep
steps
tune mirrors with PICSno mirrors
Figure: Time cost of calculating gravity for various mirrors and no mirror on 16kcores on Blue Gene/Q
Yanhua Sun Parallel Programming Laboratory, UIUC 22/25
Message Compression
Control points: compression algorithm for each type messageRuntime control points
0
500
1000
1500
2000
10 20 30 40 50 60
times
tep(
ms/
step
)
step
r1=0.1, r2=1.0
Figure: Steering the compression algorithm for all-to-all benchmark
Yanhua Sun Parallel Programming Laboratory, UIUC 23/25
Jacobi3d Performance Steering
Control Points: sub-block size in each dimensionThree control pointsCache miss rate, high idle suggest decreasing sub-block sizeOverhead
0
0.5
1
1.5
2
2.5
3
3.5
4
5 10 15 20 25 30 35 40
times
tep(
ms/
step
)
step
total timeidle time cpu time
runtime overhead
Figure: Jacobi3d performance steering on 64 cores for problem of1024*1024*1024
Yanhua Sun Parallel Programming Laboratory, UIUC 24/25
Conclusion
Introspective control system is required to improve productivity andperformance
Automatic performance analysis helps guide performance steering
Steering both runtime system and applications are important
Implemented the system based on Charm++ programming model
Acknowledgment
This work was supported in part by NIH Grant 9P41GM104601, Center forMacromolecular Modeling and Bioinformatics. It was also supported inpart by DOE DE-AC02-06CH11357 Argo Project. This research usedresources of the Argonne Leadership Computing Facility at ArgonneNational Laboratory.
Yanhua Sun Parallel Programming Laboratory, UIUC 25/25