Stall-Time Fair Memory Access Scheduling

Stall-Time Fair Memory Access

Scheduling

Onur Mutlu and Thomas MoscibrodaComputer Architecture Group

Microsoft Research

2

Multi-Core Systems

CORE 0 CORE 1 CORE 2 CORE 3

L2 CACHE

L2 CACHE

L2 CACHE

L2 CACHE

DRAM MEMORY CONTROLLER

DRAM Bank 0

DRAM Bank 1

DRAM Bank 2

DRAM Bank 7

. . .

Shared DRAMMemory System

Multi-CoreChip

unfairness

3

DRAM Bank Operation

Row Buffer

Access Address (Row 0, Column 0)

Row

dec

oder

Column decoder

Row address 0

Column address 0

Data

Row 0Empty


Column address 1


Column address 9


HITHIT

Row address 1

Row 1

Column address 0

CONFLICT !

Columns

Row

s

4

DRAM Controllers A row-conflict memory access takes significantly longer

than a row-hit access

Current controllers take advantage of the row buffer Commonly used scheduling policy (FR-FCFS) [Rixner, ISCA’00]

(1) Row-hit (column) first: Service row-hit memory accesses first(2) Oldest-first: Then service older accesses first

This scheduling policy aims to maximize DRAM throughput But, it is unfair when multiple threads share the DRAM system

5

Outline

The Problem Unfair DRAM Scheduling

Stall-Time Fair Memory Scheduling Fairness definition Algorithm Implementation System software support

Experimental Evaluation Conclusions

6

The Problem

Multiple threads share the DRAM controller DRAM controllers are designed to maximize DRAM

throughput

DRAM scheduling policies are thread-unaware and unfair Row-hit first: unfairly prioritizes threads with high row

buffer locality Streaming threads Threads that keep on accessing the same row

Oldest-first: unfairly prioritizes memory-intensive threads

7

The Problem

Row BufferR

ow d

ecod

er

Column decoder

Data

Row 0

T0: Row 0

Row 0

T1: Row 16

T0: Row 0T1: Row 111

T0: Row 0T0: Row 0T1: Row 5

T0: Row 0T0: Row 0T0: Row 0T0: Row 0T0: Row 0

Request Buffer

T0: streaming threadT1: non-streaming thread

Row size: 8KB, cache block size: 64B128 requests of T0 serviced before T1

8

DRAM is the only shared resource

Consequences of Unfairness in DRAM

Vulnerability to denial of service [Moscibroda & Mutlu, Usenix Security’07]

System throughput loss Priority inversion at the system/OS level Poor performance predictability

1.051.85

4.72

7.74

9

Outline




10

Fairness in Shared DRAM Systems A thread’s DRAM performance dependent on its inherent

Row-buffer locality Bank parallelism

Interference between threads can destroy either or both A fair DRAM scheduler should take into account all

factors affecting each thread’s DRAM performance Not solely bandwidth or solely request latency

Observation: A thread’s performance degradation due to interference in DRAM mainly characterized by the extra memory-related stall-time due to contention with other threads

11

Stall-Time Fairness in Shared DRAM Systems A DRAM system is fair if it slows down equal-priority threads equally

Compared to when each thread is run alone on the same system Fairness notion similar to SMT [Cazorla, IEEE Micro’04][Luo, ISPASS’01],

SoEMT [Gabor, Micro’06], and shared caches [Kim, PACT’04]

Tshared: DRAM-related stall-time when the thread is running with other threads

Talone: DRAM-related stall-time when the thread is running alone Memory-slowdown = Tshared/Talone

The goal of the Stall-Time Fair Memory scheduler (STFM) is to equalize Memory-slowdown for all threads, without sacrificing performance Considers inherent DRAM performance of each thread

12

Outline




13

STFM Scheduling Algorithm (1) During each time interval, for each thread, DRAM

controller Tracks Tshared Estimates Talone

At the beginning of a scheduling cycle, DRAM controller Computes Slowdown = Tshared/Talone for each thread with an

outstanding legal request Computes unfairness = MAX Slowdown / MIN Slowdown

If unfairness < Use DRAM throughput oriented baseline scheduling policy

(1) row-hit first (2) oldest-first

14

STFM Scheduling Algorithm (2)

If unfairness ≥ Use fairness-oriented scheduling policy

(1) requests from thread with MAX Slowdown first (2) row-hit first (3) oldest-first

Maximizes DRAM throughput if it cannot improve fairness

Does NOT waste useful bandwidth to improve fairness If a request does not interfere with any other, it is

scheduled

15

How Does STFM Prevent Unfairness?

Row Buffer

Data

Row 0

T0: Row 0

Row 0

T1: Row 16

T0: Row 0

T1: Row 111

T0: Row 0T0: Row 0

T1: Row 5

T0: Row 0T0: Row 0

T0: Row 0

T0 Slowdown

T1 Slowdown 1.00

1.00

1.00Unfairness

1.03

1.03

1.06

1.06

1.05

1.03

1.061.031.041.08

1.04

1.041.11

1.06

1.07

1.04

1.101.14

1.03

Row 16Row 111

16

Outline




17

Implementation Tracking Tshared

Relatively easy The processor increases a counter if the thread cannot

commit instructions because the oldest instruction requires DRAM access

Estimating Talone More involved because thread is not running alone Difficult to estimate directly Observation:

Talone = Tshared - Tinterference Estimate Tinterference: Extra stall-time due to

interference

18

Estimating Tinterference(1) When a DRAM request from thread C is scheduled

Thread C can incur extra stall time: The request’s row buffer hit status might be affected by

interference Estimate the row that would have been in the row buffer if

the thread were running alone Estimate the extra bank access latency the request incurs

Extra Bank Access LatencyTinterference(C) +=

# Banks Servicing C’s Requests

Extra latency amortized across outstanding accesses of thread C (memory level parallelism)

19

Estimating Tinterference(2) When a DRAM request from thread C is scheduled

Any other thread C’ with outstanding requests incurs extra stall time

Interference in the DRAM data bus

Interference in the DRAM bank (see paper)

Bus Transfer Latency of Scheduled RequestTinterference(C’) +=

Bank Access Latency of Scheduled RequestTinterference(C’) +=

# Banks Needed by C’ Requests * K

20

Hardware Cost

<2KB storage cost for 8-core system with 128-entry memory request buffer

Arithmetic operations approximated Fixed point arithmetic Divisions using lookup tables

Not on the critical path Scheduler makes a decision only every DRAM cycle

More details in paper

21

Outline




22

Support for System Software Supporting system-level thread weights/priorities

Thread weights communicated to the memory controller Larger-weight threads should be slowed down less

Each thread’s slowdown is scaled by its weight Weighted slowdown used for scheduling

Favors threads with larger weights OS can choose thread weights to satisfy QoS requirements

: Maximum tolerable unfairness set by system software Don’t need fairness? Set large. Need strict fairness? Set close to 1. Other values of : trade-off fairness and throughput

23

Outline




24

Evaluation Methodology 2-, 4-, 8-, 16-core systems

x86 processor model based on Intel Pentium M 4 GHz processor, 128-entry instruction window 512 Kbyte per core private L2 caches

Detailed DRAM model based on Micron DDR2-800 128-entry memory request buffer 8 banks, 2Kbyte row buffer Row-hit round-trip latency: 35ns (140 cycles) Row-conflict latency: 70ns (280 cycles)

Benchmarks SPEC CPU2006 and some Windows Desktop applications 256, 32, 3 benchmark combinations for 4-, 8-, 16-core

experiments

25

Comparison with Related Work Baseline FR-FCFS [Rixner et al., ISCA’00]

Unfairly penalizes non-intensive threads with low-row-buffer locality FCFS

Low DRAM throughput Unfairly penalizes non-intensive threads

FR-FCFS+Cap Static cap on how many younger row-hits can bypass older accesses Unfairly penalizes non-intensive threads

Network Fair Queueing (NFQ) [Nesbit et al., Micro’06] Per-thread virtual-time based scheduling

A thread’s private virtual-time increases when its request is scheduled Prioritizes requests from thread with the earliest virtual-time Equalizes bandwidth across equal-priority threads Does not consider inherent performance of each thread

Unfairly prioritizes threads with non-bursty access patterns (idleness problem)

Unfairly penalizes threads with unbalanced bank usage (in paper)

26

Idleness/Burstiness Problem in Fair Queueing

Thread 1’s virtual time increases even though no other thread needs DRAMOnly Thread 2 serviced in interval [t1,t2] since its virtual time is smaller than Thread 1’sOnly Thread 3 serviced in interval [t2,t3] since its virtual time is smaller than Thread 1’sOnly Thread 4 serviced in interval [t3,t4] since its virtual time is smaller than Thread 1’s

Non-bursty thread suffers large performance loss even though it fairly utilized DRAM when no other thread needed it

Serviced

Serviced

Serviced

Serviced

27

Unfairness on 4-, 8-, 16-core Systems

1

1.5

2

2.5

3

3.5

4

4.5

5

5.5

6

4-core 8-core 16-core

Unf

airn

ess

FR-FCFSFCFSFR-FCFS+CapNFQSTFM

Unfairness = MAX Memory Slowdown / MIN Memory Slowdown

1.27X 1.81X1.26X

28

System Performance

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1


Nor

mal

ized

Wei

ghte

d Sp

eedu

p


5.8% 4.1% 4.6%

29

Hmean-speedup (Throughput-Fairness Balance)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

1.3

1.4


Nor

mal

ized

Hm

ean

Spee

dup


10.8% 9.5% 11.2%

30

Outline




31

Conclusions A new definition of DRAM fairness: stall-time fairness

Equal-priority threads should experience equal memory-related slowdowns

Takes into account inherent memory performance of threads

New DRAM scheduling algorithm enforces this definition Flexible and configurable fairness substrate Supports system-level thread priorities/weights QoS policies

Results across a wide range of workloads and systems show: Improving DRAM fairness also improves system throughput STFM provides better fairness and system performance than

previously-proposed DRAM schedulers

Thank you. Questions?

Stall-Time Fair Memory Access

Scheduling

Onur Mutlu and Thomas MoscibrodaComputer Architecture Group

Microsoft Research

Backup

35

Structure of the STFM Controller

36

Comparison using NFQ QoS Metrics Nesbit et al. [MICRO’06] proposed the following

target for quality of service: A thread that is allocated 1/Nth of the memory system

bandwidth will run no slower than the same thread on a private memory system running at 1/Nth of the frequency of the shared physical memory system

Baseline with memory bandwidth scaled down by N

We compared different DRAM schedulers’ effectiveness using this metric Number of violations of the above QoS target Harmonic mean of IPC normalized to the above baseline

37

Violations of the NFQ QoS Target

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

50%

55%

60%


% W

orkl

oads

whe

re Q

oS O

bjec

tive

NO

T Sa

tisfie

d


38

Hmean Normalized IPC using NFQ Baseline

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

1.3


Hm

ean

of N

orm

aliz

ed IP

C (u

sing

Nes

bit's

bas

elin

e)


10.3% 9.1% 7.8%7.3% 5.9% 5.1%

39

Shortcomings of the NFQ QoS Target Low baseline (easily achievable target) for equal-priority

threads N equal-priority threads a thread should do better than on a

system with 1/Nth of the memory bandwidth This target is usually very easy to achieve

Especially when N is large

Unachievable target in some cases Consider two threads always accessing the same bank in an

interleaved fashion too much interference

Baseline performance very difficult to determine in a real system Cannot scale memory frequency arbitrarily Not knowing baseline performance makes it difficult to set

thread priorities (how much bandwidth to assign to each thread)

40

A Case Study

0

1

2

3

4

5

6

7

8

FR-FCFS FCFS FR-FCFS+Cap NFQ STFM

Nor

mal

ized

Mem

ory

Stal

l Tim

e

mcflibquantumGemsFDTDastar

Unfairness: 7.28 2.07 2.08 1.87 1.27

Mem

ory

Slow

dow

n

41

Windows Desktop Workloads

42

Enforcing Thread Weights

43

Effect of

44

Effect of Banks and Row Buffer Size

Stall-Time Fair Memory Access Scheduling

Documents

Transcript of Stall-Time Fair Memory Access Scheduling