Weighted Dynamic Scheduling with Many Parallelism Grains ...Runtime Scheduler Loop 11/16/15, Austin,...

Weighted Dynamic Scheduling with Many Parallelism Grains for

Offloading of Numerical Workloads to Multiple Varied

Accelerators

Azzam Haidar Yulu Jia

Piotr Luszczek Stanimire Tomov

Asim YarKhan Jack Dongarra

Problem Statement: Factorization

11/16/15, Austin, TX, USA ScalA 2015 2

Ax = b PA = LU Ly = P-1b Ux = y

GETF2 TRSM GEMM

Problem Statement: Algorithm

GETF2 TRSM GEMM

Problem Statement: Mapping to Hardware

GETF2 TRSM GEMM

From Code to DAG

GETRF(A[1:n,1:n]){fori=1,nb,2*nb,…{ GETF2(A[i:n,i:i+nb]) TRSM(A[i:i+nb,i:i+nb],A[i:i+nb,i:n]) GEMM(A[i:n,i:i+nb],A[i:i+nb,i:n],A[i+nb:n,i+nb:n])}

GETF2(A[1,1],A[2,1],…)TRSM(A[1,1],A[1,2])TRSM(A[1,1],A[1,3])GEMM(A[2,1],A[1,2],A[2,2])

Startwithcanonicalcode:

Unwindthefunc:oncalls: Trackdependences:

Thefunc:oncallsandtheirdependencesformaDAG.

Matrix-View of Dependences

XeonPhi

From Code to DAG

Asynchronous Algorithm

Panelfactoriza:on

Pivo:ng(rowswaps)

Triangularsolve

Matrixmul:ply(Schurcomplement)

Scheduletasksaccordingto:-datasizes-hardwareweights-affinity-performlookahead

Main Features of the Scheduler

•  Resource capability weight for the task – w(kernel, device)

•  Adaptive scheduling with weights •  Dynamic lookahead •  Enhanced task priorities – Regulates lookahead (DAG exploration order)

•  Transparent data movement – Uses (and tracks) asynchronous data transfers

•  Dynamic data redistribution – Data is moved as needed and only if needed

Runtime Scheduler Loop

defmain_thread_loop(user_code,queues,threshold):#ifthereareenoughtasksforcoresifqueues.total_length()>threshold: #resumeuser’scodeforsubmittingtasks task=user_code.get_next_task() q=queues.find_closes_queue(task.devices()) q.insert(task,task.priority())else: task=queues.steal_task(main_cpu) #iftasksavailableforstealing ifnottask.is_empty(): task.execute()#executesingletask else: queues.wait_for_tasks()

Effect of Dynamic Scheduling

Performance: Kepler K20

Performance: Xeon Phi

Performance: Xeon Phi + Kepler K40

Summary of Contributions

•  Fine- and coarse-grained tasks for scheduling •  Capability weights for hardware description •  Unified scheduling across CPUs, GPUs, and

coprocessors •  Synchronous memory-transfer model with

transparent asynchronous progress

Weighted Dynamic Scheduling with Many Parallelism Grains ...Runtime Scheduler Loop 11/16/15, Austin,...

Documents

Transcript of Weighted Dynamic Scheduling with Many Parallelism Grains ...Runtime Scheduler Loop 11/16/15, Austin,...

CPU Scheduling - Interdisciplinary Scheduling 2 Roadmap ... • Multilevel-feedback-queue scheduler defined by the following parameters: – number of queues – scheduling algorithms

Title (Arial Bold Italic 20pt) - Hot Chips ALU LAGU SAGU Div Mul Ld/St Queues BU FP Decode Rename VALU VALU FPAdd FPMul VIMul St Conv. 32KB DCache Int Rename Scheduler Scheduler FP

9. Alternative Konzepte: Parallele funktionale Programmierung Implicit Parallelism Controlled Parallelism Explicit Parallelism Data Parallelism Control.

Improving Utilization and Parallelism of Hadoop Cluster by Elastic … · 2019-12-18 · Fig. 4. Scheduling elastic containers by the default scheduler and by an ideal scheduler.

(kpreemptive) - Forsiden special hardware into a virtual clusion ... ("pseudoparallelism") 'Multiprocessor: Overlapping ("true parallelism") Terminate (call scheduler)

Stacks and Queues. Abstract Data Types Stacks Queues Priority Queues September 2004John Edgar2.

Queues & Priority Queues Radix Sort Heap Sort. Outline n Queues queue operationsqueue operations algorithms with queuesalgorithms with queues radix sortradix.

Enabling Task Parallelism in the CUDA Schedulerecosimulation.com/chrisgregg/Publications/TaskParallelismCuda.pdfEnabling Task Parallelism in the CUDA Scheduler ... since OpenGL and

Background on Concurrency & Parallelism in Java (Part 2)schmidt/cs891f/2018-PDFs/... · e.g., Java executor framework, synchronizers, blocking queues, atomics, & concurrent collections

Refactoring Conventional Task Schedulers to Exploit ...€¦ · Task parallelism Data parallelism. Contribution Asymmetry-oblivious scheduler Asymmetry-aware + DLA library Task parallelism

Summarizing presentation of Scheduler Activations – A different approach to parallelism

Data Sheet - Cutter Networks - Call:727-398-5252 for … · • Carrier Ethernet demarcation device delivering ... Every flow has its own queues and scheduler. ETX-201A supports up

Maximo Scheduler & Scheduler Plus Overview

Lecture 8: OpenMP. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism / Implicit parallelism.

Code scheduling for multiple instruction stream architectures · Instruction Level Parallelism on MIMD Architectures 245 to VLIW, and the queues provide the same dynamic scheduling

Maximo Scheduler & Scheduler Plus

CMSC 421 Spring 2004 Section 0202 Contents …dchakr1/421/slides/pdf/Ch6.pdfMultilevel-feedback-queue scheduler is defined by the following parameters)number of queues)scheduling algorithms

Parallelism and Faulty Parallelism Game quiz

Hardware Parallelism vs. Software Parallelism · 2019-02-25 · Hardware shared library ... Hardware Parallelism vs. Software Parallelism USENIX Workshop on Hot Topics in Parallelism

Concurrency, Parallelism, Events, Asynchronicity, Oh My threads, processes, queues & workers, multiprocessing, gevent, twisted, yo momma SoCal Piggies.