Fair Share Scheduling

Fair Share Scheduling

Ethan Bolker

Mathematics & Computer Science UMass Boston

[email protected]

www.cs.umb.edu/~eb

Queen’s University

March 23, 2001

2

References• www.bmc.com/patrol/fairshare • www/cs.umb.edu/~eb/goalmode

• Yiping Ding• Jeff Buzen• Dan Keefe• Oliver Chen• Chris Thornley

Acknowledgements• Aaron Ball• Tom Larard• Anatoliy Rikun• Liying Song

3

Coming Attractions

• Queueing theory primer • Fair share semantics• Priority scheduling; conservation laws• Predicting response times from shares

– analytic formula– experimental validation– applet simulation

• Implementation geometry

4

Transaction Workload• Stream of jobs visiting a server

(ATM, time shared CPU, printer, …)• Jobs queue when server is busy• Input:

– Arrival rate: job/sec – Service demand: s sec/job

• Performance metrics:– server utilization: u = s (must be

1)– response time: r = ??? sec/job (average)– degradation: d = r/s

5

Response time computations• r, d measure queueing delay

r s (d 1), unless parallel processing possible

• Randomness really mattersr = s (d = 1) if arrivals scheduled (best case, no waiting)

r >> s for bulk arrivals (worst case, maximum delays)

• Theorem. If arrivals are Poisson and service is exponentially distributed (M/M/1) then d = 1/(1- u) r = s/(1- u)

• Think: virtual server with speed 1-u

6

M/M/1• Essential nonlinearity often counterintuitive

– at u = 95% average degradation is 1/(1-0.95) = 20,– but 1 customer in 20 has no wait at all (5% idle time)

• A useful guide even when hypotheses fail– accurate enough ( 30%) for real computer systems– d depends only on u: many small jobs have same impact as few

large jobs– faster system smaller s smaller u r = s/(1-u)

double win: less service, less wait– waiting costly, server cheap (telephones): want u 0– server costly (doctors): want u 1 but scheduled

7

• Customers want good response times

• Decreasing u is expensive

• High end Unix offerings from HP, IBM, Sun offer fair share scheduling packages that allow an administrator to allocate scarce resources (CPU, processes, bandwidth) among workloads

• How do these packages behave?

• Model as a black box, independent of internals

• Limit study to CPU shares on a uniprocessor

Scheduling for Performance

8

Multiple Job Streams

• Multiple workloads, utilizations u1, u2, …

• U = ui < 1

• If no workload prioritization then all degradations are equal: di = 1/(1-U)

• Share allocations are de facto prioritizations

• Study degradation vector V = (d1, d2, …)

9

Share Semantics

• Suppose workload w has CPU share fw

• Normalize shares so that w fw = 1

• w gets fraction fw of CPU time slices when at least one of its jobs is ready for service

• Can it use more if competing workloads idle?

No : think share = cap

Yes : think share = guarantee

10

Shares As Caps

dedicated system share f

u/f need f > u !

• Good for accounting (sell fraction of web server)• Available now from IBM, HP, soon from Sun • Straightforward (boring) - workloads are isolated• Each runs on a virtual processor with speed *= f

utilization u response time r r(1 u)/(f u)

11

Shares As Guarantees

• Good for performance + economy (use otherwise idle resources)

• Shares make a difference only when there are multiple workloads

• Large share resembles high priority: share may be less than utilization

• Workload interaction is subtle, often unintuitive, hard to explain

12

Modeling

PerformanceGoals

complex scheduling software

OS

qu

ery

up

dat

e

rep

ort

resp

onse

tim

e

workload

measure frequently

fastcomputation

analyticalgorithms

Model

13

Modeling• Real system

– Complex, dynamic, frequent state changes– Hard to tease out cause and effect

• Model– Static snapshot, deals in averages and probabilities– Fast enlightening answers to “what if ” questions

• Abstraction helps you understand real system• Start with a study of priority scheduling

Priority Scheduling• Priority state: order workloads by priority (ties OK)

– two workloads, 3 states: 12, 21, [12]– three workloads, 13 states:

• 123 (6 = 3! of these ordered states), • [12]3 (3 of these), • 1[23] (3 of these), • [123] (1 state with no priorities)

– n wkls, f(n) states, n! ordered (simplex lock combos)• p(s) = prob( state = s ) = fraction of time in state s• V(s) = degradation vector when state = s (measure this, or compute it

using queueing theory) • V = s p(s)V(s) (time avg is convex combination)• Achievable region is convex hull of vectors V(s)

15

Two workloads

d1

V(12) (wkl 1 high prio)

V(21)

V([12]) (no priorities)

achievable region

d2

d1 = d2

16

Two workloads

d1


V(21)


0.5 V

(12)

+ 0.

5V(2

1)

V([1

2])

d2

d1 = d2

17

Two workloads

d1


V(21)


d2

d1 = d2

note: u1 < u2 wkl 2 effect on wkl 1 large

18

Conservation• No Free Lunch Theorem. Weighted average degradation

is constant, independent of priority scheduling scheme:

i (ui /U) di = 1/(1-U)

• Provable from some hypotheses• Observable in some real systems• Sometimes false: shortest job first minimizes average

response time (printer queues, supermarket express checkout lines)

19

Conservation• For any proper set A of workloads

Imagine giving those workloads top priority. Then can pretend other wkls don’t exist. In that case

i A (ui /U(A)) di = 1/(1-U(A))When wkls in A have lower priorities they have higher degradations, so in general

i A (ui /U(A)) di 1/(1-U(A))

• These 2n -2 linear inequalities determine the convex achievable region R

• R is a permutahedron: only n! vertices

20

Two Workloads

u 1d1 + u 2d2 = 1/(1-U)

d1 : workload 1 degradation

conservation law:

(d1 , d2 ) lies on the line

d 2 : w

orkl

oad

2 de

grad

atio

n

21

d 1 1/(1- u1 )

constraint resultingfrom workload 1


d 2 : w

orkl

oad

2 de

grad

atio

nTwo Workloads

22

Workload 1 runs at high priority:

V(1,2) = (1 /(1- u1 ), 1 /(1- u1 )(1-U) )

d1 1 /(1- u1 )

constraint resultingfrom workload 1


d 2 : w

orkl

oad

2 de

grad

atio

nTwo Workloads

23

V(2,1)

d2 1 /(1- u2 )


d 2 : w

orkl

oad

2 de

grad

atio

nTwo Workloads

u 1d1 + u 2d2 = 1/(1-U)

24

V(2,1)

achievable region R


d 2 : w

orkl

oad

2 de

grad

atio

nTwo Workloads

u 1d1 + u 2d2 = 1/(1-U)

V(1,2) = (1 /(1- u1 ), 1 /(1- u1 )(1-U) )

d1 1 /(1- u1 )

d2 1 /(1- u2 )

25

Three Workloads• Degradation vector (d1,d2, d3) lies on plane u1 d1

+ u2 d2 + u3dr3 = C

• We know a constraint for each workload w: uw dw Cw

• Conservation applies to each pair of wkls as well: u1 d1 + u2 d2 C12

• Achievable region has one vertex for each priority ordering of workloads: 3! = 6 in all

• Hence its name: the permutahedron

26

3! = 6 vertices (priority orders)

23 - 2 = 6 edges(conservation constraints)

d2

d1

d3

V(2,1,3)

u1 r1 + u2 d2 + u3 d3 = C

V(1,2,3)

Three Workload Permutahedron

27

Experimental evidence

28

Four workload permutahedron4! = 24 vertices (ordered states)

24 - 2 = 14 facets (proper subsets)(conservation constraints)

74 faces (states)

Simplicial geometry and transportation polytopes,Trans. Amer. Math. Soc. 217 (1976) 138.

29

Map shares to degradations- two workloads -

• Suppose f1 and f2 > 0 , f1 + f2 = 1

• Model: System operates in state – 12 with probability f1

– 21 with probability f2

(independent of who is on queue)

• Average degradation vector:

V = f1 V(12) + f2 V(21)

Dec 13, 2000 Fair Share Scheduling 30

Predict Degradations From Shares(Two Workloads)

• Reasonable modeling assumption: f1 = 1, f2 = 0 means workload 1 runs at high priority

• For arbitrary shares: workload priority order is (1,2) with probability f1

(2,1) with probability f2 (probability = fraction of time)

• Compute average workload degradation: d1 = f1

(wkl 1 degradation at high priority) + f2 (wkl 1 degradation at low priority )

31

Model validation

32

Model validation

33

Map shares to degradations- three (n) workloads -

f1 f2 f3prob(123) = ------------------------------ (f1 + f2 + f3) (f2 + f3) (f3)

• Theorem: These n! probabilities sum to 1– interesting identity generalizing adding fractions– prove by induction, or by coupon collecting

• V = ordered states s prob(s) V(s) • O(n!), (n!), good enough for n 9 (12)

34

Model validation

35

Model validation

36

The Fair Share Applet• Screen captures on next slides are from

www.bmc.com/patrol/fairshare

• Experiment with “what if” fair share modeling

• Watch a simulation

• Random virtual job generator for the simulation is the same one used to generate random real jobs for our benchmark studies

37

1

2

3

• Three workloads, each with utilization 0.32 jobs/second 1.0 seconds/job = 0.32 = 32%

• CPU 96% busy, so average (conserved) response time is 1.0/(10.96) = 25 seconds

• Individual workload average response times depend on shares

Three Transaction Workloads

???

???

???

38

1

2

3

• Normalized f3 = 0.20 means 20% of the time workload 3 (development) would be dispatched at highest priority


• During that time, workload priority order is (3,1,2) for 32/80 of the time, (3,2,1) for 48/80

• Probability( priority order is 312 ) = 0.20(32/80) = 0.08

sum 80.032.0

48.0

20.0

39

• Formulas on previous slide

• Average predicted response time weighted by throughput 25 seconds (as expected)

• Hard to understand intuitively

• Software helps


40


note change from 32%

41

Simulation

jobs currently on run queue

42

When the Model Fails• Real CPU uses round robin scheduling to deliver time

slices

• Short jobs never wait for long jobs to complete

• That resembles shortest job first, so response time conservation law fails

• At high utilization, simulation shows smaller response times than predicted by model

• Response time conservation law yields conservative predictions

43

Scaling Degradation Predictions• V = ordered states s prob(s) V(s)

• Each s is a permutation of (1,2, … , n)

• Think of it as a vector in n-space

• Those n! vectors lie on of a sphere

• For n large they are pretty densely packed

• Think of prob(s) as a discrete approximation to a probability distribution on the sphere

• V is an integral

44

Monte Carlo

• loop sampleSize timeschoose a permutation s at random from the

distribution determined by the shares

compute degradation vector V(s)

accumulate V += prob(s)V(s)

• sampleSize = 40000 works well independent of n!

45

Map shares to degradations(geometry)

• Interpret shares as barycentric coordinates in the n-1 simplex

• Study the geometry of the map from the simplex to the n-1 dimensional permutahedron

• Easy when n=2: each is a line segment and map is linear

46

Mapping a triangle to a hexagon

f1 = 1 f 1 =

0

f3 = 0

f3 = 1

132

123

213

312

321

231

wkl 1 high priority

wkl 1 low priority

M

47


f1 = 1

{23}

f 1 =

0

48


49

What This Means

• Add a strong statement that summarizes how you feel or think about this topic

• Summarize key points you want your audience to remember

Fair Share Scheduling

Documents

Transcript of Fair Share Scheduling