4.1.2. Run time

4.1.2. Run time• Technology Drivers

– Scale, variance (uncertainty in characterization of apps and resource availability), Heterogeneity of resources, hierarchical structure of systems and applications, Latency

• Alternative R&D strategy– Flat vs hierarchical

• Recommender research agenda– Heterogeneity. data transfer. Scheduling.– Hierarchical multiple levels / flat ? Hybrid (Interaction between levels)?– Asynchrony:

• Run time dependence analysis, JIT compilation, Interaction with compiler• Scheduling: Dynamic, predictive

– Basic mechanisms for thread handling and communication. Reduce overhead and latency, (Interaction arch.)– Optimize usage of communication infrastructure: routes, mapping, overlap communication/computation– Scheduling for parallel efficiency: computation time, Load balance, granularity control, Malleability– Scheduling for memory efficiency: Locality handling– Shared address space– Memory management.– Application/area specific run times

• Crosscutting considerations– Resilience: run time to implement fine grain mechanisms and fire coarse grain mechanisms– Power management: drive knobs, interact with job scheduler– Performance: run time instrumentation. Report suggestions to user. Interact with job scheduler– Programmability:

Heterogeneity

Key challenges

Unified/transparent accelerator run time models

Address heterogeneity of nodes and interconnects in cluster.

Scheduling for latency tolerance and bandwidth minimization

Adaptive granularity

Support execution of same program on different heterogeneous platforms

Optimize utilization of resource and execution time

Different granularities supported by platforms

Hide specificities of accelerator from programmer

Summary of research direction

Potential impact on software component Potential impact on usability, capability, and breadth of community

Broaden the portability of programs

Load Balance

Key challenges

General purpose self tuned run times: Detect imbalance and reallocate resources (cores, storage, DVFS, BW,…) within/across level(s).

Application specific load balancing run times

Minimization of impact of temporary resource shortage (OS noise, external urgent needs, …)

Adapt to variability in time and space (processes) of applications and systems

Optimize resource utilization, reduce execution time

Drastically reduce the effort needed to ensure efficient resource utilization and thus let programmers focus on functionality.

Only use resources that can be profitably used. Maximize ratio of achieved performance to power

5 years



Self tuned runtimes

Crosscut: Perf. Analysis, Job scheduling

Flat model

Key challenges

Keep memory requirements small and constant.

Thread based MPI (rank per thread).

Introduction of high levels of asynchrony: MPI Collectives, APGAS, Data-flow,…

Adapt communication subsystems (routing, mapping, RDMA, …) to application characteristics

Improve the performance of basic process management and synchronization mechanism

Resource requirements (computing power, memory, network) by runtime implementation

Overcome limitations deriving from global synchronizing calls (barriers, collectives,….)

Optimize usage of communication resources

MPI: Leverage current applications

2-5 years



Increased scalability

Hierarchical/hybrid

Key challenges

Hierarchical integration of runtimes (MPI+PGAS, MPI+threaded+Accelerator, MPI+accelerator, PGAS+Accelerator,…)

Modularity, reusability. Libraries compatibility.

Dimensioning of processes/threads. Scheduling, mapping to nodes

Memory placement and thread affinity

Match between model semantics at the different levels

Match platform structure, efficient usage of resources

Constrain size of name spaces



Better match to hardware (i.e. shared memory within node)

Interaction with Load balance and Job scheduling

Enable smooth migration path

Improved performance

5 years

4.1.2. Run time

HeterogeneityPower management

job sched

Async/Overlaphierarchy

Dynamic Memory association ??

2010 2011 2012 2013 2014 2015 2016 2017 2018 2019

Your MetricLoad balance

Memory efficiency

scheduling for locality

Resilience

4.1.2. Run time: what & whyAssume responsibility for matching algorithm characteristics/demands to available resources, optimizing their usage

Run the “same” source on 2 different heterogeneoussystemsDo it for a couple of kernels and real applications

Run times performing, dynamic memory association (work arrays, renaming,…),

tolerating functional noise,

2010 2011 2012 2013 2014 2015 2016 2017 2018 2019

Dynam

icity, decoupling algorithm form

resources

General purpose Run time automatically achieving load balance, optimized network usage, power minimization, malleability, tolerance to performance noise, … on heterogeneous system

Demonstration that automatic locality aware scheduling can

get a factor on 5x in highly NUMA memory hierarchies

Run times tolerating injection rate of 10

errors/hour

Demonstrate that asynchrony can get

for both flat and hybrid systems 3x strong

scalability By this time EVERYBODY will be fed up with writing the same

application again and again

The alternative will be to use current machines

So that we can finally rest

Machines will fail more than we do

A target if we want to get there

Fighting variance is a lost battle, learn to live with it

4.1.2. Run time

Documents

Transcript of 4.1.2. Run time