Stall-Time Fair Memory Access Scheduling
description
Transcript of Stall-Time Fair Memory Access Scheduling
Stall-Time Fair Memory Access
Scheduling
Onur Mutlu and Thomas MoscibrodaComputer Architecture Group
Microsoft Research
2
Multi-Core Systems
CORE 0 CORE 1 CORE 2 CORE 3
L2 CACHE
L2 CACHE
L2 CACHE
L2 CACHE
DRAM MEMORY CONTROLLER
DRAM Bank 0
DRAM Bank 1
DRAM Bank 2
DRAM Bank 7
. . .
Shared DRAMMemory System
Multi-CoreChip
unfairness
3
DRAM Bank Operation
Row Buffer
Access Address (Row 0, Column 0)
Row
dec
oder
Column decoder
Row address 0
Column address 0
Data
Row 0Empty
Access Address (Row 0, Column 1)
Column address 1
Access Address (Row 0, Column 9)
Column address 9
Access Address (Row 1, Column 0)
HITHIT
Row address 1
Row 1
Column address 0
CONFLICT !
Columns
Row
s
4
DRAM Controllers A row-conflict memory access takes significantly longer
than a row-hit access
Current controllers take advantage of the row buffer Commonly used scheduling policy (FR-FCFS) [Rixner, ISCA’00]
(1) Row-hit (column) first: Service row-hit memory accesses first(2) Oldest-first: Then service older accesses first
This scheduling policy aims to maximize DRAM throughput But, it is unfair when multiple threads share the DRAM system
5
Outline
The Problem Unfair DRAM Scheduling
Stall-Time Fair Memory Scheduling Fairness definition Algorithm Implementation System software support
Experimental Evaluation Conclusions
6
The Problem
Multiple threads share the DRAM controller DRAM controllers are designed to maximize DRAM
throughput
DRAM scheduling policies are thread-unaware and unfair Row-hit first: unfairly prioritizes threads with high row
buffer locality Streaming threads Threads that keep on accessing the same row
Oldest-first: unfairly prioritizes memory-intensive threads
7
The Problem
Row BufferR
ow d
ecod
er
Column decoder
Data
Row 0
T0: Row 0
Row 0
T1: Row 16
T0: Row 0T1: Row 111
T0: Row 0T0: Row 0T1: Row 5
T0: Row 0T0: Row 0T0: Row 0T0: Row 0T0: Row 0
Request Buffer
T0: streaming threadT1: non-streaming thread
Row size: 8KB, cache block size: 64B128 requests of T0 serviced before T1
8
DRAM is the only shared resource
Consequences of Unfairness in DRAM
Vulnerability to denial of service [Moscibroda & Mutlu, Usenix Security’07]
System throughput loss Priority inversion at the system/OS level Poor performance predictability
1.051.85
4.72
7.74
9
Outline
The Problem Unfair DRAM Scheduling
Stall-Time Fair Memory Scheduling Fairness definition Algorithm Implementation System software support
Experimental Evaluation Conclusions
10
Fairness in Shared DRAM Systems A thread’s DRAM performance dependent on its inherent
Row-buffer locality Bank parallelism
Interference between threads can destroy either or both A fair DRAM scheduler should take into account all
factors affecting each thread’s DRAM performance Not solely bandwidth or solely request latency
Observation: A thread’s performance degradation due to interference in DRAM mainly characterized by the extra memory-related stall-time due to contention with other threads
11
Stall-Time Fairness in Shared DRAM Systems A DRAM system is fair if it slows down equal-priority threads equally
Compared to when each thread is run alone on the same system Fairness notion similar to SMT [Cazorla, IEEE Micro’04][Luo, ISPASS’01],
SoEMT [Gabor, Micro’06], and shared caches [Kim, PACT’04]
Tshared: DRAM-related stall-time when the thread is running with other threads
Talone: DRAM-related stall-time when the thread is running alone Memory-slowdown = Tshared/Talone
The goal of the Stall-Time Fair Memory scheduler (STFM) is to equalize Memory-slowdown for all threads, without sacrificing performance Considers inherent DRAM performance of each thread
12
Outline
The Problem Unfair DRAM Scheduling
Stall-Time Fair Memory Scheduling Fairness definition Algorithm Implementation System software support
Experimental Evaluation Conclusions
13
STFM Scheduling Algorithm (1) During each time interval, for each thread, DRAM
controller Tracks Tshared Estimates Talone
At the beginning of a scheduling cycle, DRAM controller Computes Slowdown = Tshared/Talone for each thread with an
outstanding legal request Computes unfairness = MAX Slowdown / MIN Slowdown
If unfairness < Use DRAM throughput oriented baseline scheduling policy
(1) row-hit first (2) oldest-first
14
STFM Scheduling Algorithm (2)
If unfairness ≥ Use fairness-oriented scheduling policy
(1) requests from thread with MAX Slowdown first (2) row-hit first (3) oldest-first
Maximizes DRAM throughput if it cannot improve fairness
Does NOT waste useful bandwidth to improve fairness If a request does not interfere with any other, it is
scheduled
15
How Does STFM Prevent Unfairness?
Row Buffer
Data
Row 0
T0: Row 0
Row 0
T1: Row 16
T0: Row 0
T1: Row 111
T0: Row 0T0: Row 0
T1: Row 5
T0: Row 0T0: Row 0
T0: Row 0
T0 Slowdown
T1 Slowdown 1.00
1.00
1.00Unfairness
1.03
1.03
1.06
1.06
1.05
1.03
1.061.031.041.08
1.04
1.041.11
1.06
1.07
1.04
1.101.14
1.03
Row 16Row 111
16
Outline
The Problem Unfair DRAM Scheduling
Stall-Time Fair Memory Scheduling Fairness definition Algorithm Implementation System software support
Experimental Evaluation Conclusions
17
Implementation Tracking Tshared
Relatively easy The processor increases a counter if the thread cannot
commit instructions because the oldest instruction requires DRAM access
Estimating Talone More involved because thread is not running alone Difficult to estimate directly Observation:
Talone = Tshared - Tinterference Estimate Tinterference: Extra stall-time due to
interference
18
Estimating Tinterference(1) When a DRAM request from thread C is scheduled
Thread C can incur extra stall time: The request’s row buffer hit status might be affected by
interference Estimate the row that would have been in the row buffer if
the thread were running alone Estimate the extra bank access latency the request incurs
Extra Bank Access LatencyTinterference(C) +=
# Banks Servicing C’s Requests
Extra latency amortized across outstanding accesses of thread C (memory level parallelism)
19
Estimating Tinterference(2) When a DRAM request from thread C is scheduled
Any other thread C’ with outstanding requests incurs extra stall time
Interference in the DRAM data bus
Interference in the DRAM bank (see paper)
Bus Transfer Latency of Scheduled RequestTinterference(C’) +=
Bank Access Latency of Scheduled RequestTinterference(C’) +=
# Banks Needed by C’ Requests * K
20
Hardware Cost
<2KB storage cost for 8-core system with 128-entry memory request buffer
Arithmetic operations approximated Fixed point arithmetic Divisions using lookup tables
Not on the critical path Scheduler makes a decision only every DRAM cycle
More details in paper
21
Outline
The Problem Unfair DRAM Scheduling
Stall-Time Fair Memory Scheduling Fairness definition Algorithm Implementation System software support
Experimental Evaluation Conclusions
22
Support for System Software Supporting system-level thread weights/priorities
Thread weights communicated to the memory controller Larger-weight threads should be slowed down less
Each thread’s slowdown is scaled by its weight Weighted slowdown used for scheduling
Favors threads with larger weights OS can choose thread weights to satisfy QoS requirements
: Maximum tolerable unfairness set by system software Don’t need fairness? Set large. Need strict fairness? Set close to 1. Other values of : trade-off fairness and throughput
23
Outline
The Problem Unfair DRAM Scheduling
Stall-Time Fair Memory Scheduling Fairness definition Algorithm Implementation System software support
Experimental Evaluation Conclusions
24
Evaluation Methodology 2-, 4-, 8-, 16-core systems
x86 processor model based on Intel Pentium M 4 GHz processor, 128-entry instruction window 512 Kbyte per core private L2 caches
Detailed DRAM model based on Micron DDR2-800 128-entry memory request buffer 8 banks, 2Kbyte row buffer Row-hit round-trip latency: 35ns (140 cycles) Row-conflict latency: 70ns (280 cycles)
Benchmarks SPEC CPU2006 and some Windows Desktop applications 256, 32, 3 benchmark combinations for 4-, 8-, 16-core
experiments
25
Comparison with Related Work Baseline FR-FCFS [Rixner et al., ISCA’00]
Unfairly penalizes non-intensive threads with low-row-buffer locality FCFS
Low DRAM throughput Unfairly penalizes non-intensive threads
FR-FCFS+Cap Static cap on how many younger row-hits can bypass older accesses Unfairly penalizes non-intensive threads
Network Fair Queueing (NFQ) [Nesbit et al., Micro’06] Per-thread virtual-time based scheduling
A thread’s private virtual-time increases when its request is scheduled Prioritizes requests from thread with the earliest virtual-time Equalizes bandwidth across equal-priority threads Does not consider inherent performance of each thread
Unfairly prioritizes threads with non-bursty access patterns (idleness problem)
Unfairly penalizes threads with unbalanced bank usage (in paper)
26
Idleness/Burstiness Problem in Fair Queueing
Thread 1’s virtual time increases even though no other thread needs DRAMOnly Thread 2 serviced in interval [t1,t2] since its virtual time is smaller than Thread 1’sOnly Thread 3 serviced in interval [t2,t3] since its virtual time is smaller than Thread 1’sOnly Thread 4 serviced in interval [t3,t4] since its virtual time is smaller than Thread 1’s
Non-bursty thread suffers large performance loss even though it fairly utilized DRAM when no other thread needed it
Serviced
Serviced
Serviced
Serviced
27
Unfairness on 4-, 8-, 16-core Systems
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
6
4-core 8-core 16-core
Unf
airn
ess
FR-FCFSFCFSFR-FCFS+CapNFQSTFM
Unfairness = MAX Memory Slowdown / MIN Memory Slowdown
1.27X 1.81X1.26X
28
System Performance
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
4-core 8-core 16-core
Nor
mal
ized
Wei
ghte
d Sp
eedu
p
FR-FCFSFCFSFR-FCFS+CapNFQSTFM
5.8% 4.1% 4.6%
29
Hmean-speedup (Throughput-Fairness Balance)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
1.3
1.4
4-core 8-core 16-core
Nor
mal
ized
Hm
ean
Spee
dup
FR-FCFSFCFSFR-FCFS+CapNFQSTFM
10.8% 9.5% 11.2%
30
Outline
The Problem Unfair DRAM Scheduling
Stall-Time Fair Memory Scheduling Fairness definition Algorithm Implementation System software support
Experimental Evaluation Conclusions
31
Conclusions A new definition of DRAM fairness: stall-time fairness
Equal-priority threads should experience equal memory-related slowdowns
Takes into account inherent memory performance of threads
New DRAM scheduling algorithm enforces this definition Flexible and configurable fairness substrate Supports system-level thread priorities/weights QoS policies
Results across a wide range of workloads and systems show: Improving DRAM fairness also improves system throughput STFM provides better fairness and system performance than
previously-proposed DRAM schedulers
Thank you. Questions?
Stall-Time Fair Memory Access
Scheduling
Onur Mutlu and Thomas MoscibrodaComputer Architecture Group
Microsoft Research
Backup
35
Structure of the STFM Controller
36
Comparison using NFQ QoS Metrics Nesbit et al. [MICRO’06] proposed the following
target for quality of service: A thread that is allocated 1/Nth of the memory system
bandwidth will run no slower than the same thread on a private memory system running at 1/Nth of the frequency of the shared physical memory system
Baseline with memory bandwidth scaled down by N
We compared different DRAM schedulers’ effectiveness using this metric Number of violations of the above QoS target Harmonic mean of IPC normalized to the above baseline
37
Violations of the NFQ QoS Target
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
50%
55%
60%
4-core 8-core 16-core
% W
orkl
oads
whe
re Q
oS O
bjec
tive
NO
T Sa
tisfie
d
FR-FCFSFCFSFR-FCFS+CapNFQSTFM
38
Hmean Normalized IPC using NFQ Baseline
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
1.3
4-core 8-core 16-core
Hm
ean
of N
orm
aliz
ed IP
C (u
sing
Nes
bit's
bas
elin
e)
FR-FCFSFCFSFR-FCFS+CapNFQSTFM
10.3% 9.1% 7.8%7.3% 5.9% 5.1%
39
Shortcomings of the NFQ QoS Target Low baseline (easily achievable target) for equal-priority
threads N equal-priority threads a thread should do better than on a
system with 1/Nth of the memory bandwidth This target is usually very easy to achieve
Especially when N is large
Unachievable target in some cases Consider two threads always accessing the same bank in an
interleaved fashion too much interference
Baseline performance very difficult to determine in a real system Cannot scale memory frequency arbitrarily Not knowing baseline performance makes it difficult to set
thread priorities (how much bandwidth to assign to each thread)
40
A Case Study
0
1
2
3
4
5
6
7
8
FR-FCFS FCFS FR-FCFS+Cap NFQ STFM
Nor
mal
ized
Mem
ory
Stal
l Tim
e
mcflibquantumGemsFDTDastar
Unfairness: 7.28 2.07 2.08 1.87 1.27
Mem
ory
Slow
dow
n
41
Windows Desktop Workloads
42
Enforcing Thread Weights
43
Effect of
44
Effect of Banks and Row Buffer Size