Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar...

44
Starling: A Scheduler Architecture for High Performance Cloud Computing Hang Qu, Omid Mashayekhi, David Terei, Philip Levis Stanford Platform Lab Seminar May 17, 2016 1

Transcript of Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar...

Page 1: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud

Starling: A Scheduler Architecture for High Performance Cloud Computing

Hang Qu, Omid Mashayekhi, David Terei, Philip LevisStanford Platform Lab Seminar

May 17, 2016

1

Page 2: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud

High Performance Computing (HPC)

• Arithmetically dense codes (CPU-bound)• Operate on in-memory data

• A lot of machine learning is HPC

SpaceX

2

NOAA Pixar

Page 3: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud

HPC on the Cloud

• HPC has struggled to use cloud resources• HPC software has certain expectations– Cores are statically allocated, never change– Cores do not fail– Every core has identical performance

• Cloud doesn’t meet these expectations– Elastic resources, dynamically reprovisioned– Expect some failures– Variable performance

3

Page 4: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud

HPC in a Cloud Framework?

• Strawman: run HPC codes inside a cloud framework (e.g., Spark, MapReduce, Naiad, etc.)– Framework handles all of the cloud’s challenges for you:

scheduling, failure, resource adaptation

• Problem: too slow (by orders of magnitude)– Spark can schedule 2,500 tasks/second– Typical HPC compute task is 10ms

‣ Each core can execute 100 tasks/second‣ A single 18 core machine can execute 1,800 tasks/second

– Queueing theory and batch processing mean you want to operate well below the maximum scheduling throughput

4

Page 5: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud

Starling

• Scheduling architecture for high performance cloud computing (HPCC)• Controller decides data distribution, workers

decide what tasks to execute– In steady state, no worker/controller communication except

periodic performance updates

• Can schedule up to 120,000,000 tasks/s– Scales linearly with number of cores

• HPC benchmarks run 2.4-3.3x faster

5

Page 6: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud

Outline

• HPC background• Starling scheduling architecture• Evaluation• Thoughts and questions

6

Page 7: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud

High Performance Computing

• Arithmetically intense floating point problems– Simulations– Image processing– If you might want to do it on a GPU but it’s not graphics, it’s

probably HPC

• Embedded in a geometry, data dependencies

7

Page 8: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud

Example: Semi-lagrangian Advection

8

A fluid is represented as a gridEach cell has a velocityUpdate value/velocity for next time step

Page 9: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud

9

Page 10: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud

Example: Semi-lagrangian Advection

10

A fluid is represented as a gridEach cell has a velocityUpdate value/velocity for next time step

Page 11: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud

Example: Semi-lagrangian Advection

11

A fluid is represented as a gridEach cell has a velocityUpdate value/velocity for next time step

Walk velocity vector backwards, interpolate

Page 12: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud

Example: Semi-lagrangian Advection

12

A fluid is represented as a gridEach cell has a velocityUpdate value/velocity for next time step

Walk velocity vector backwards, interpolate

What if partitions A+B are on different nodes?

A B

Page 13: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud

Example: Semi-lagrangian Advection

13

A fluid is represented as a gridEach cell has a velocityUpdate value/velocity for next time step

Walk velocity vector backwards, interpolate

What if partitions A+B are on different nodes?

A B

Each partition has partial datafrom other, adjacent partitions(“ghost cells”)

ab

Page 14: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud

High Performance Computing

• Arithmetically intense floating point problems– Simulations– Image processing– If you might want to do it on a GPU but it’s not graphics, it’s

probably HPC

• Embedded in a geometry, data dependencies– Many small data exchanges after computational steps– Ghost cells, reductions, etc.

14

Page 15: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud

Most HPC Today (MPI)

15

while (time < duration) { // locally calculate, then global min dt = calculate_dt(); // locally calculate, then exchange update_velocity(dt); // locally calculate, then exchange update_fluid(dt); // locally calculate, then exchange update_particles(dt); time += dt; } Control flow is explicit in each process.

Operate in lockstep.Partitioning and communication is static.A single process fails, program crashes.Runs as fast as slowest process.Assumes every node runs at the same speed.

Problems in the cloud

Page 16: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud

Most HPC Today (MPI)

16

while (time < duration) { // locally calculate, then global min dt = calculate_dt(); // locally calculate, then exchange update_velocity(dt); // locally calculate, then exchange update_fluid(dt); // locally calculate, then exchange update_particles(dt); time += dt; } Control flow is explicit in each process.

Operate in lockstep.Partitioning and communication is static.A single process fails, program crashes.Runs as fast as slowest process.Assumes every node runs at the same speed.

Page 17: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud

HPC inside a cloud framework?

17

cloud (EC2, GCP)

HPC code

cloud (EC2, GCP)

cloud framework

HPC codeload balancingfailure recoveryelastic resourcesstragglers

{

Page 18: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud

Frameworks too slow

• Depend on a central controller to schedule tasks to worker nodes• Controller can schedule ~2,500 tasks/second• Fast, optimized HPC tasks can be ~10ms long– 18-core worker executes

1,800 tasks/second– Two workers overwhelm

a scheduler– Prior work is fast or

distributed, not both

18

TasksController

Worker

Worker

Worker

Driver

TasksStats

Page 19: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud

Outline

• HPC background• Starling scheduling architecture• Evaluation• Thoughts and questions

19

Page 20: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud

Best of both worlds?

• Depend on a central controller to– Balance load– Recover from failures– Adapt to changing resources– Handle stragglers

• Depend on worker nodes to– Schedule computations

‣ Don’t execute in lockstep‣ Use AMT/software processor model (track dependencies)

– Exchange data

20

Page 21: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud

Control Signal

• Need a different control signal between controller and workers to schedule tasks• Intuition: tasks execute where data is resident– Data moves only for load balancing/recovery

• Controller determines how data is distributed• Workers generate schedule tasks based on

what data they have

21

Page 22: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud

22

RDDLineage�DriverProgram�

Par//onManagerMaster�

P1� P3�W2� W2�

P2�W1�

T1� T2�

T4�

T3�

Scheduler�

TaskGraph�

Par//onManagerSlave�

P1�W2�

P2�W1� T1�

T2�

TaskQueue�

Reportdata

placement�

Slavesreportlocalpar//ons�

Par//onManagerSlave�

P1�W2�

P3�W2� T3�

TaskQueue�

SparkController�

SparkWorker1� SparkWorker2�

Controller Architectures

Page 23: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud

Controller Architectures

23

RDDLineage�DriverProgram�

Par//onManagerMaster�

P1� P3�W2� W2�

P2�W1�

T1� T2�

T4�

T3�

Scheduler�

TaskGraph�

Par//onManagerSlave�

P1�W2�

P2�W1� T1�

T2�

TaskQueue�

Reportdata

placement�

Slavesreportlocalpar//ons�

Par//onManagerSlave�

P1�W2�

P3�W2� T3�

TaskQueue�

SparkController�

SparkWorker1� SparkWorker2�

Par$$onManager�

P1� P3�W2� W2�

P2�W1�

Snapshot�

Scheduler�Decidedataplacement�

Broadcastdataplacement�

StarlingController�

CanaryWorker1� CanaryWorker2�

Par$$onManagerReplica�

P1� P3�W2� W2�

P2�W1�

T3�

Scheduler

Par$$onManagerReplica�

P1� P3�W2� W2�

P2�W1�

T1�TaskGraph

Specifyastagetosnapshotat�

T1�

T3�TaskQueue

DriverProgram

T4�

Scheduler

T2�TaskGraph T2�

TaskQueue

DriverProgram

Page 24: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud

Partitioning Specification

• Driver program computes valid placements– Task accesses determine a set of constraints

‣ E.g., A1 and B1 must be on the same node, A2 and B2, etc.

• Micropartitions (overdecomposition)– If expected max cores is n, create 2-8n partitions

24

Page 25: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud

Managing Migrations

• Controller does not know where workers are in the program• Workers do not know where each other are in

program• When a data partition moves, need to ensure

that destination picks up where source left off

25

Page 26: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud

Example Task Graph

26

write�

read�

sca*erW0�

sum0�

gatherG0�

again?0�

GW0�

GG0�

gatherW0�

gradient0�

sca*erG0�

LW0�

TD0�

LG0�

LW1�

TD1�

LG1�

gatherW1�

gradient1�

sca*erG1�

LW: local weightTD: training dataLG: local gradientGW: global weightGG: global gradient

Page 27: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud

Execution Tracking

27

sca$erW�

gatherG�

GG0�

GW0�

gatherW�

gradient�

sca$erG�

LW0�

TD0�

LG0�

:�

:��

:��

:��

:�sum�

again?�

StageId� Stage� Par++on�:Metadata�

Programposi;on�

sca$erW�

71�

72�

73�

74�

75�

76�

77�

78�

66�

72�

67�

70�

71�

...Each worker executes the identical program in parallelControl flow is identical (like GPUs)Tag each partition with last stage that accessed itSpawn all subsequent stagesData exchanges implicitly synchronize

Page 28: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud

Implicitly Synchronize

28

write�

read�

sca*erW0�

sum0�

gatherG0�

again?0�

GW0�

GG0�

gatherW0�

gradient0�

sca*erG0�

LW0�

TD0�

LG0�

LW1�

TD1�

LG1�

gatherW1�

gradient1�

sca*erG1�

LW: local weightTD: training dataLG: local gradientGW: global weightGG: global gradient

Stage gatherG0 cannot execute until both gradient1 and gradient2 complete

Stage again?0 cannot execute until sum0 completes

Page 29: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud

Spawning and Executing Local Tasks

• Metadata for each data partition is the last stage that reads from or writes to it.• After finishing a task, a worker:– Updates metadata.

– Examines task candidates that operate on data partitions generated by the completed task.

– Puts a candidate task into a ready queue if all data partitions it operates on are (1) local, and (2) modified by the right tasks.

29

Page 30: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud

Two Communication Primitives

• Scatter-gather– Similar to MapReduce, Spark GroupBy, etc.– Takes one or more datasets as input– Produces one or more datasets as output– Used for data exchange

• Signal– Deliver a boolean result to every node– Used for control flow

30

Page 31: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud

EVALUATION

31

Page 32: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud

Evaluation Questions

• How many tasks/second can Starling schedule?• Where are the scheduling bottlenecks?• Does this improve computational performance?– Logistic regression– K-means, PageRank– Lassen– PARSEC fluidanimate

• Can Starling adapt HPC applications to changes in available nodes?

32

Page 33: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud

Scheduling Throughput

33

0 10 20 30 40 50 60 70

Average task length (µs)

40

60

80

100

CP

Uut

iliza

tion

(6.6µs, 90%)

A single core can schedule 136,000 tasks/second while using 10% of CPU cycles.Scheduling puts a task on a local queue (no locks); no I/O needed.

Page 34: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud

Scheduling Throughput

34

Scheduling throughput increases linearly with number of cores.120,000,000 tasks/second on 1,152 cores (10% CPU overhead)

Bottleneck is data transmission on partition migration.

Page 35: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud

Overprovisioned Scheduling

35

Workload is 3,600 tasks/second.Queueing delay due to batched scheduling meansmuch higher throughput needed.

Page 36: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud

Queueing Delay

36

Core�

task(1s)�

task(1s)�

task(1s)�

task(1s)�

Core�

Core�

Core�

0.00s� 0.25s� 0.50s� 0.75s� 1.50s�1.00s� 1.25s� 1.75s�Controllerdispatches4tasks/sec�

Load is 4 tasks/second.If scheduler can handle 4 tasks/second, queueing delay increasesexecution time from 1s to 1.75s (75%).

Page 37: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud

Logistic Regression

37

Optimal micropartitioning on 32 workers is >18,000 tasks/second

Controller: c4.large instance (1 core)32 workers: c4.8xlarge instance (18 cores)

Page 38: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud

Lassen(Eulerian)

38Controller: c4.large instance (1 core)32 workers: c4.8xlarge instance (18 cores)

Micro partitioning + load balancing runs3.3x faster than MPI.

Page 39: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud

PARSEC fluidanimate(Lagrangian)

39Controller: c4.large instance (1 core)32 workers: c4.8xlarge instance (18 cores)

Micro partitioning + load balancing runs2.4x faster than MPI.

Page 40: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud

Migration Costs

40

32 nodes, reprovisioned so computation moves off of 16 nodesonto 16 new nodes. Data transfer takes several seconds at 9Gbps.

Migration times are huge compared to task execution times.Migration not a bottleneck at controller.

Page 41: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud

Starling Evaluation Summary

• Can schedule 136,000 tasks/s/core– 120,000,000/s on 1,152 cores– Need to overprovision scheduling/high throughput

• Schedules HPC applications on >1,000 cores• Micropartitions improve performance– Increase required throughput: 5-18,000 tasks/s on 32 nodes

• Central load balancer improves performance– 2.4-3.3x for HPC benchmarks

• Can adapt to changing resources

41

Page 42: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud

Outline

• HPC background• Starling scheduling architecture• Evaluation• Thoughts and questions

42

Page 43: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud

Thoughts

• Many newer cloud computing workloads resembled high performance computing– I/O bound workloads are slow– CPU bound workloads are fast

• Next generation systems will draw from both– Cloud computing: variation, completion time, failure,

programming models, system decomposition– HPC: scalability, performance

43

Page 44: Starling: A Scheduler Architecture for High Performance ...platformlab.stanford.edu/Seminar Talks/starling-pl.pdf · Starling: A Scheduler Architecture for High Performance Cloud

Thank you!

PhilipLevis

HangQu

OmidMashayekhi

DavidTerei

ChinmayeeShah