4.1.2. Run time
description
Transcript of 4.1.2. Run time
4.1.2. Run time• Technology Drivers
– Scale, variance (uncertainty in characterization of apps and resource availability), Heterogeneity of resources, hierarchical structure of systems and applications, Latency
• Alternative R&D strategy– Flat vs hierarchical
• Recommender research agenda– Heterogeneity. data transfer. Scheduling.– Hierarchical multiple levels / flat ? Hybrid (Interaction between levels)?– Asynchrony:
• Run time dependence analysis, JIT compilation, Interaction with compiler• Scheduling: Dynamic, predictive
– Basic mechanisms for thread handling and communication. Reduce overhead and latency, (Interaction arch.)– Optimize usage of communication infrastructure: routes, mapping, overlap communication/computation– Scheduling for parallel efficiency: computation time, Load balance, granularity control, Malleability– Scheduling for memory efficiency: Locality handling– Shared address space– Memory management.– Application/area specific run times
• Crosscutting considerations– Resilience: run time to implement fine grain mechanisms and fire coarse grain mechanisms– Power management: drive knobs, interact with job scheduler– Performance: run time instrumentation. Report suggestions to user. Interact with job scheduler– Programmability:
Heterogeneity
Key challenges
Unified/transparent accelerator run time models
Address heterogeneity of nodes and interconnects in cluster.
Scheduling for latency tolerance and bandwidth minimization
Adaptive granularity
Support execution of same program on different heterogeneous platforms
Optimize utilization of resource and execution time
Different granularities supported by platforms
Hide specificities of accelerator from programmer
Summary of research direction
Potential impact on software component Potential impact on usability, capability, and breadth of community
Broaden the portability of programs
Load Balance
Key challenges
General purpose self tuned run times: Detect imbalance and reallocate resources (cores, storage, DVFS, BW,…) within/across level(s).
Application specific load balancing run times
Minimization of impact of temporary resource shortage (OS noise, external urgent needs, …)
Adapt to variability in time and space (processes) of applications and systems
Optimize resource utilization, reduce execution time
Drastically reduce the effort needed to ensure efficient resource utilization and thus let programmers focus on functionality.
Only use resources that can be profitably used. Maximize ratio of achieved performance to power
5 years
Summary of research direction
Potential impact on software component Potential impact on usability, capability, and breadth of community
Self tuned runtimes
Crosscut: Perf. Analysis, Job scheduling
Flat model
Key challenges
Keep memory requirements small and constant.
Thread based MPI (rank per thread).
Introduction of high levels of asynchrony: MPI Collectives, APGAS, Data-flow,…
Adapt communication subsystems (routing, mapping, RDMA, …) to application characteristics
Improve the performance of basic process management and synchronization mechanism
Resource requirements (computing power, memory, network) by runtime implementation
Overcome limitations deriving from global synchronizing calls (barriers, collectives,….)
Optimize usage of communication resources
MPI: Leverage current applications
2-5 years
Summary of research direction
Potential impact on software component Potential impact on usability, capability, and breadth of community
Increased scalability
Hierarchical/hybrid
Key challenges
Hierarchical integration of runtimes (MPI+PGAS, MPI+threaded+Accelerator, MPI+accelerator, PGAS+Accelerator,…)
Modularity, reusability. Libraries compatibility.
Dimensioning of processes/threads. Scheduling, mapping to nodes
Memory placement and thread affinity
Match between model semantics at the different levels
Match platform structure, efficient usage of resources
Constrain size of name spaces
Summary of research direction
Potential impact on software component Potential impact on usability, capability, and breadth of community
Better match to hardware (i.e. shared memory within node)
Interaction with Load balance and Job scheduling
Enable smooth migration path
Improved performance
5 years
4.1.2. Run time
HeterogeneityPower management
job sched
Async/Overlaphierarchy
Dynamic Memory association ??
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
Your MetricLoad balance
Memory efficiency
scheduling for locality
Resilience
4.1.2. Run time: what & whyAssume responsibility for matching algorithm characteristics/demands to available resources, optimizing their usage
Run the “same” source on 2 different heterogeneoussystemsDo it for a couple of kernels and real applications
Run times performing, dynamic memory association (work arrays, renaming,…),
tolerating functional noise,
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
Dynam
icity, decoupling algorithm form
resources
General purpose Run time automatically achieving load balance, optimized network usage, power minimization, malleability, tolerance to performance noise, … on heterogeneous system
Demonstration that automatic locality aware scheduling can
get a factor on 5x in highly NUMA memory hierarchies
Run times tolerating injection rate of 10
errors/hour
Demonstrate that asynchrony can get
for both flat and hybrid systems 3x strong
scalability By this time EVERYBODY will be fed up with writing the same
application again and again
The alternative will be to use current machines
So that we can finally rest
Machines will fail more than we do
A target if we want to get there
Fighting variance is a lost battle, learn to live with it