1 Two Case Studies in Predictable Application Scheduling Using Rialto/NT Michael B. Jones –...

46
1 Two Case Studies in Two Case Studies in Predictable Predictable Application Application Scheduling Using Scheduling Using Rialto/NT Rialto/NT Michael B. Jones – Microsoft Research John Regehr – University of Virginia Stefan Saroiu – University of Washington

Transcript of 1 Two Case Studies in Predictable Application Scheduling Using Rialto/NT Michael B. Jones –...

1

Two Case Studies in Two Case Studies in Predictable Application Predictable Application

Scheduling Using Rialto/NTScheduling Using Rialto/NT

Michael B. Jones – Microsoft Research

John Regehr – University of Virginia

Stefan Saroiu – University of Washington

2

Application Case StudiesApplication Case Studies

Two applications needing predictable execution on Windows 2000 Soft Modem Driver Digital Audio Player

The case studies analyze behavior on normal Windows 2000 study improvements possible using

Rialto/NT CPU Reservation mechanism

3

Consumer Real-TimeConsumer Real-Time General-purpose Operating Systems,

such as Windows 2000: maximize aggregate throughput approximate fair sharing of the resources

Increasing use of time-dependent tasks signal processing, audio, video

Need support for: predictable scheduling for independently

developed applications low latency responses explicit resource allocation mechanisms

4

Rialto/NT AbstractionsRialto/NT Abstractions

Two real-time software abstractions: CPU Reservations – ongoing reservation for

at least X time units out of every Y units for a thread

Time Constraints – one-shot time reservation for specified amount of work between start time and deadline

Case studies use only CPU Reservations

5

Rialto/NT ImplementationRialto/NT Implementation

Rialto/NT developed on top of Windows 2000 priority scheduler

Limitations: CPU Reservations must be integer

multiples of milliseconds Frequency of reservations must be

power-of-two multiple of 1ms

6

First Case StudyFirst Case Study

Predictable Scheduling for a Soft Modem

7

Why Study Soft Modems ?Why Study Soft Modems ?

Signal Processing done on host CPU: requires predictable scheduling requires low latency responses

While coexisting with other system activities Soft Modem is a background real-time task

Successful in home computer market: Low cost Easy to update – software upgrade

8

MethodologyMethodology Instrumented Windows 2000 performance kernel:

Logs predefined and custom events Writes them to a memory buffer Dumps buffers to disk at end of trace

Driver Software: No source for signal processing code

Measurement Environment: All experiments run with normal-priority spinning

competitor thread System:

Windows 2000 Professional Pentium II 450 MHz (uniprocessor) 384 MB ECC SDRAM - 100 MB allocated to logging

9

Vendor Driver Version - Vendor Driver Version - Processing in Interrupt (INT)Processing in Interrupt (INT)

Operation of the modem: 1. DMA transfers between A/D and D/A and

physical memory 2. When enough data samples, the modem

raises an interrupt 3. Inside ISR, process incoming data and

provide outgoing samples, before buffers exhausted

Uses input and output data buffers holding 512 16-bit samples (1024 bytes/buffer)

10

Three Additional VersionsThree Additional Versions

DPC Version (DPC) The ISR queues a DPC DPC performs signal processing

Thread Version (THR) The ISR queues a DPC that signals a thread via a

semaphore Thread performs signal processing Experimented with several different priorities

Rialto/NT Version (RES) Same as THR, but thread scheduled using

Rialto/NT real-time periodic CPU Reservation

11

Interrupt RateInterrupt Rate3 different phases, interrupts very

regularRate of Interrupts (INT)

0

5

10

15

20

25

30

35

0 5 10 15 20 25 30

Time (seconds)

Mil

lise

con

ds

On-hook ConnectedTrainingDialing

Falls within PC 99 recommended interrupt rates of 3-16ms

12

Elapsed Times in ISR (INT)Elapsed Times in ISR (INT)

PC 99 recommends maximum time during which a driver-based modem disables interrupts should not exceed 100 µs

1.8 ms with repeatable worst case of 3.3 ms

Elapsed Times in Interrupt Handler (INT)

0

0.5

1

1.5

2

2.5

3

3.5

0 5 10 15 20 25 30

Time (seconds)

Mil

lis

ec

on

ds

On-hook ConnectedTrainingDialing

13

CPU UtilizationCPU Utilization14.7% sustained load on 450MHz Pentium

IICPU Load

0%

5%

10%

15%

20%

25%

30%

35%

0 5 10 15 20 25 30

Time (seconds)

CP

U L

oad

On-hook ConnectedTrainingDialing

14

Elapsed Times in ISR (DPC)Elapsed Times in ISR (DPC)

ISR times now small, typically < 6µs

Elapsed Times In Interrupt Handler (DPC)

0

2

4

6

8

10

12

14

16

0 5 10 15 20 25 30

Time (seconds)

Mic

ros

ec

on

ds

On-hook ConnectedTrainingDialing

15

Elapsed Times in Queued DPCElapsed Times in Queued DPC

PC 99 recommends that the total execution time required for all queued DPCs should not exceed 500 µs

But now long DPC times: 1.8ms avg., 3.3 max (same as elapsed times in ISR for INT)

Elapsed Times In Queued DPC (DPC)

0

0.5

1

1.5

2

2.5

3

3.5

0 5 10 15 20 25 30

Time (seconds)

Mil

lis

ec

on

ds

On-hook ConnectedTrainingDialing

16

Samples Pending to be ProcessedSamples Pending to be Processed(INT & THR 24)(INT & THR 24)

Small relative to 512 sample buffer sizeSamples Pending to be Processed (INT)

0

5

10

15

20

25

30

35

0 5 10 15 20 25 30

Time (seconds)

Un

pro

ce

ss

ed

Sa

mp

les

On-hook ConnectedTrainingDialing

Samples Pending to be Processed (THR 24)

0

5

10

15

20

25

30

35

0 5 10 15 20 25 30

Time (seconds)

Un

pro

ce

ss

ed

Sa

mp

les

On-hook ConnectedTrainingDialing

17

Samples Pending to be Samples Pending to be Processed (THR 8)Processed (THR 8)

Unsurprisingly, contention kills modem

Samples Pending to be Processed (THR 8)

0

100

200

300

400

500

600

0 5 10 15 20 25 30 35

Time (seconds)

Un

pro

cess

ed S

amp

les

On-hook "Please hang up and try your call again"Dialing

18

Latency ResultsLatency Results

Set the multimedia timers to fire once every millisecond

Register a routine to be called every millisecond

Routine does very little work Stores cycle counter value and sleeps again

Histograms show differences between recorded times and ideal times

19

Coexisting Thread Latencies Coexisting Thread Latencies (Control Case - No Modem)(Control Case - No Modem)

Maximum 1978µs between wakeupsControl Case - No Modem

0.0%

0.5%

1.0%

1.5%

2.0%

2.5%

3.0%

Latency (microseconds)

Pe

rce

nta

ge

of

Ca

llb

ac

ks 96.8%

20

Coexisting Thread Latencies Coexisting Thread Latencies (INT)(INT)

Maximum 5313µs between wakeupsINT Version

0.0%

0.5%

1.0%

1.5%

2.0%

2.5%

3.0%

Latency (microseconds)

Pe

rce

nta

ge

of

Ca

llb

ac

ks 83.1%

21

Coexisting Thread Latencies Coexisting Thread Latencies (DPC)(DPC)

Maximum 4396µs between wakeupsDPC Version

0.0%

0.5%

1.0%

1.5%

2.0%

2.5%

3.0%

Latency (microseconds)

Pe

rce

nta

ge

of

Ca

llb

ac

ks 82.6%

22

Coexisting Thread Latencies Coexisting Thread Latencies (THR 24)(THR 24)

Maximum 2239µs between wakeupsTHR Version (24)

0.0%

0.5%

1.0%

1.5%

2.0%

2.5%

3.0%

Latency (microseconds)

Pe

rce

nta

ge

of

Ca

llb

ac

ks 93.8%

23

What Have We Learned So Far?What Have We Learned So Far? Signal processing in the context of the

interrupt handler is: unnecessary detrimental to the latencies and predictability of

coexisting activities

Vendor choice understandable For any priority there is a potentially unbounded

delay between the interrupt and the thread running

In practice Delays are reasonable for well-configured systems

[Intel OSDI ’99] Using interrupts extreme form of priority inflation

24

Two Possible SolutionsTwo Possible Solutions Rate Monotonic Analysis – determine the

“right” priority assignments among all threads - two problems: Assumes cooperative priority assignment among all

threads - unrealistic Working priority assignment dependent upon

timing requirements of all threads Changes in application mix may require changes

in priority assignments

Use a time-based real-time scheduler Such as Rialto/NT

25

Samples Pending to be Processed Samples Pending to be Processed (RES 2ms/8ms – 25%)(RES 2ms/8ms – 25%)

Fits well within 512-sample buffer sizeSamples Pending to be Processed (RES 2ms/8ms)

0

20

40

60

80

100

120

140

160

0 5 10 15 20 25 30 35

Time (seconds)

Un

pro

ce

ss

ed

Sa

mp

les

On-hook ConnectedTrainingDialing

26

Coexisting Thread Latencies Coexisting Thread Latencies (RES 2ms/8ms – 25%)(RES 2ms/8ms – 25%)Maximum 1971µs between

wakeupsRES Version (2ms/8ms)

0.0%

1.0%

2.0%

3.0%

4.0%

5.0%

6.0%

7.0%

Latency (microseconds)

Pe

rce

nta

ge

of

Ca

llb

ac

ks

85.5%

27

File Transfer TimesFile Transfer Times

Min Max Mean Std Dev Passed

INT 36.334 36.398 36.367 0.029 10DPC 36.272 36.447 36.396 0.048 10THR Pri 24 36.319 36.475 36.384 0.056 10RES 1ms/7ms 36.333 36.724 36.426 0.112 10RES 2ms/13ms 36.288 36.975 36.547 0.232 10RES 2ms/14ms 38.631 91.713 65.172 37.535 2RES 3ms/15ms 36.275 36.586 36.387 0.108 10RES 3ms/16ms 97.289 180.415 110.523 26.408 9RES 4ms/16ms 36.255 37.116 36.415 0.256 10RES 8ms/20ms 36.347 36.476 36.394 0.039 10

Results for 10 copies of 200,000 bytes each

For 1/8, 2/15, 3/17, 4/17, 7/20 no test passed

28

Modem Reservation RangesModem Reservation RangesSensitivity to both percentage and gaps

If period < 12.5ms, must get 14.7% to workIf period > 12.5ms, (period – amount) >=

12.5ms must also hold

Modem Reservation Operating Ranges

0

1

2

3

4

5

6

7

8

9

10

0 2 4 6 8 10 12 14 16 18 20 22Reservation Period (ms)

Re

se

rva

tio

n A

mo

un

t (m

s)

Sufficient MarginalInsufficient Actual14.7% of CPU 12.5ms Gaps

SufficientCPU Percentageand Frequency

GapsToo

Long

Insufficient Percentage

29

Soft Modem ConclusionsSoft Modem Conclusions Signal Processing in interrupt context is:

Unnecessary Detrimental to the predictability and latencies of the

coexisting activities The DPC version has similar problems Threads help alleviate these problems

Modem runs well with real-time priorities and non-real-time competition

However modem threads may interfere with other threads

Real-time scheduler allows Control over modem’s degree of interference with other

time-sensitive activities Performance isolation for threads using reservations

30

Industry PerspectiveIndustry Perspective Vendor did try their own THR version

Worked fine during normal load However, modem was starved when:

Copying data between two IDE devices Using USB scanner (Intel 440BX chipset) that

turned off interrupts for 30-50 ms Therefore they shipped the INT version

Vendor is willing to be a “good citizen” only if ensured that others would be as well

Systematic latency timing verification of components is needed to enforce good behavior

31

Soft DSL is ComingSoft DSL is Coming

More demanding than soft modems 4ms processing period

G.lite 1.531Mbps downstream and 512Kbps upstream ~ 25% of a 600 MHz Pentium III

Full rate DSL 3.062Mbps downstream and 512Kbps upstream Nearly 50% of a 600 MHz Pentium III

Soft Bluetooth period 312.5µs

32

Further Soft Modem StudiesFurther Soft Modem Studies

Software-based Digital Subscriber Line (SoftDSL) studies

Multiple Soft Modems within the same machine

Similar studies on multiprocessors

33

Second Case StudySecond Case Study

Predictable Scheduling for Digital Audio

34

MethodologyMethodology Empirically reverse-engineer thread

requirements in a complex, legacy soft real-time application without use of source code

Assign CPU reservations to threads without modifying the application

Measure application behavior during contention

35

Windows Media PlayerWindows Media Player Default player for mp3, wav, avi, mpeg Experimental method

Modelled contention using spinning thread at various priorities

Gave CPU Reservations to media player threads

Played an mp3 song Listened for glitches Used instrumented kernel to detect buffer

under-runs

36

Media Player Thread Media Player Thread Structure (Simplified)Structure (Simplified)

Thread Period (ms) Priority

Kernel Mixer (*) 10 24

MP3 Decoder (*) 100 9

User Interface 45 8

Disk Reader 2000 8

(*) Received CPU Reservations in some experiments.

37

MP3 Playback w/o ContentionMP3 Playback w/o Contention

Kmixer thread (top) runs every 10ms MP3 decoder (4th line) runs every 100ms Works fine

38

Starvation Caused by Competing Starvation Caused by Competing Thread @ Priority 10Thread @ Priority 10

Media Player runs only when NT priority inversion avoidance logic kicks in

39

Media Player + ReservationMedia Player + Reservation

1ms every 16ms reserved for decoder thread Competing with priority 10 thread Works fine

40

Priority Inversion Caused by Priority Inversion Caused by Competing ThreadCompeting Thread

Competitor thread (priority 9) preempts MP3 decoder while holding Kmixer buffer lock

Kmixer misses next two time slots (x) Starves, causes audio glitch

Fix: raise decoder priority before grabbing lock

xx

41

Media Player DeadlockMedia Player Deadlock

Circular wait among Media Player threads Deadlock broken by a timeout Fix: file a bug report…

42

Media Player ResultsMedia Player Results Expected

In the presence of contention, the Windows priority scheduler allows real-time apps to starve

This can be fixed by giving real-time threads CPU Reservation

Unexpected Competitor thread changes sequencing,

exposes races in Media Player Hard to write correct programs with

many threads & mutexes Fixed using priority ceiling emulation

43

Implications of ResultsImplications of Results Periods of threads in complex legacy apps

can be reverse engineered Amounts are platform-dependent and are

harder Next step to store application requirements

and use middleware to automatically assign reservations No application support needed Potentially a way around the chicken/egg

problem of using reservations in a world of legacy OSs and applications

44

Possible ContinuedPossible ContinuedMedia ExperimentsMedia Experiments

Study software DVD player CPU intensive and time sensitive

45

Overall ConclusionsOverall Conclusions Status quo insufficient

Applications either inflate their priorities as did the soft modem driver

or are at the mercy of applications that may be run at higher priorities as is the case with the digital audio player

CPU Reservations solve this problem by allowing applications to reliably obtain the

time they need while allowing other applications to do the same

46

For More InformationFor More Information

See Mike Jones ([email protected]): http://research.microsoft.com/~mbj/

or John Regehr ([email protected]): http://www.cs.utah.edu/~regehr/

or Stefan Saroiu ([email protected]): http://www.cs.washington.edu/homes/tzoompy/

Related papers at Mike’s web site