Tanima ISPASS2011 presentation 59ispass.org/ispass2011/slides/2_4.pdfTanima Dey Wee a g, Jac a dso ,...
Transcript of Tanima ISPASS2011 presentation 59ispass.org/ispass2011/slides/2_4.pdfTanima Dey Wee a g, Jac a dso ,...
Tanima DeyWei Wang, Jack W. Davidson, Mary L. Soffae a g, Jac a dso , a y So a
Department of Computer ScienceUniversity of Virginia
ISPASS 2011
y g
1
M i iMotivation The number of cores doubles every 18 months Expected: Performance number of cores One of the bottlenecks is shared resource contention
For multi-threaded workloads, contention is unavoidable
To reduce contention it is necessary to understand To reduce contention, it is necessary to understand where and how the contention is created
2
Shared Resource Contention inShared Resource Contention in Chip‐Multiprocessorsp p
C C C C Application 1C0 C1 C2 C3
L1 L1L1 L1
Application 1 Thread
Application 2 Thread
L2 L2
Front -Side Bus
Thread
Memory
Intel Quad Core Q95503
Scenario 1Scenario 1 Multi‐threaded applicationspp With co-runner
C0 C1 C2 C3
Application 1 Thread
3
L L
L1 L1L1 L1Application 2
Thread
L2 L2
MMemory
4
Scenario 2Scenario 2Multi‐threaded applications Without co-runner
pp
C0 C1 C2 C3
Application Thread
L2 L2
L1 L1L1 L1
L2 L2
MemoryMemory
5
Shared‐Resource Contention Intra application contention Intra-application contention
Contention among threads from the same application (No co-runners)( )
Inter-application contention Contention among threads from the co-running
application
6
C ib iContributions A general methodology to evaluate a multi-threaded g gy
application’s performance Intra-application contention Inter-application contention Contention in the memory-hierarchy shared resources
Characterizing applications facilitates better understanding of the application’s resource sensitivityunderstanding of the application s resource sensitivity
Thorough performance analyses and characterizationThorough performance analyses and characterization of multi-threaded PARSEC benchmarks
7
O tliOutline MotivationMotivation Contributions Methodologygy Measuring intra-application contention Measuring inter-application contentiong pp Related Work Summary
8
MethodologyMethodology Designed to measure both intra- and inter-
application contention for a targeted shared resourceapplication contention for a targeted shared resource L1-cache, L2-cache Front Side Bus (FSB)
Each application is run in two configurations Baseline: threads do not share the targeted resource Contention: threads share the targeted resource
Multiple number of targeted resource Determine contention by comparing performance
9
Determine contention by comparing performance (gathering hardware performance counters’ values)
O tliOutline MotivationMotivation Contributions Methodologygy Measuring intra-application contention (See paper) Measuring inter-application contentiong pp Related Work Summary
10
L1-cacheMeasuring inter‐application contention
Application 1 Thread
C0 C1 C2 C3
L1 L1L1 L1
Thread
Application 2 Thread
C0 C1 C2 C3
L1 L1L1 L1
L2 L2 L2 L2
Baseline Contention
Memory Memory
Baseline Configuration
Contention Configuration
11
lMeasuring inter‐application contention L2-cache
Application 1 Thread
C0 C1 C2 C3
L1 L1L1 L1Application 2
Thread
C0 C1 C2 C3
L1 L1L1 L1
L2 L2 L2 L2
Memory Memory
Baseline Configuration
Contention Configuration
12
M i i t li ti t tiMeasuring inter‐application contention FSB
Application 1 Thread
C0 C2 C4 C6
L1 L1L1 L1
C1 C3 C5 C7
L1 L1L1 L1
Thread
Application 2 Thread
L2 L2 L2 L2
Memory
Baseline Configuration
13
lMeasuring intra‐application contention FSB
Application 1 Thread
C0 C2 C4 C6
L1 L1L1 L1
C1 C3 C5 C7
L1 L1L1 L1Application 2
Thread
L2 L2 L2 L2
Memory
Contention Configuration
14
PARSEC BenchmarksApplication Domain Benchmark(s)Application Domain Benchmark(s)
Financial Analysis Blackscholes (BS)Swaptions (SW)
C t Vi i B d t k (BT)Computer Vision Bodytrack (BT)
Engineering Canneal (CN)
Enterprise Storage Dedup (DD)
Animation Facesim (FA)Fluidanimate (FL)
Similarity Search Ferret (FE)Similarity Search Ferret (FE)
Rendering Raytrace (RT)
Data Mining Streamcluster (SC)
15
Media Processing Vips (VP)X264 (X2)
Experimental platformExperimental platform Platform 1: Yorkfield
C C C C Intel Quad core Q9550 32 KB L1-D and L1-I
h
C0
L1 cache
L1
C1 C2 C3
L1 cache
L1
L1 cache
L1
L1 cache
L1 cache 6MB L2-cache 2GB Memory
L2 cache L2 cacheL2 L2
L1 HW‐PF
L1 HW‐PF
L1 HW‐PF
L1 HW‐PF
2GB Memory Common FSB FSB
interface
L2 HW‐PF
FSB interface
L2 HW‐PF
Memory Controller Hub (Northbridge)
FSB
Memory
MB
1616
Experimental platformExperimental platform Platform 2: Harpertown
C0
L1 cache
C2 C4 C6
L1 cache L1 cache L1 cache
C1
L1 cache
C3 C5 C7
L1 cache L1 cache L1 cache
L2 cache L2 cache
L1 HW‐PF
L1 HW‐PF
L1 HW‐PF
L1 HW‐PF
L2 cache L2 cache
L1 HW‐PF
L1 HW‐PF
L1 HW‐PF
L1 HW‐PF
L2 cache
FSB interface
L2 cacheL2
HW‐PFFSB
interface
L2 HW‐PF
L2 cache
FSB interface
L2 cacheL2
HW‐PFFSB
interface
L2 HW‐PF
Memory Controller Hub (Northbridge)FSB FSB
Tanima DeyMemory
MB
1717
Performance Analysis Inter-application contention
For i-th co-runnerPercentPerformanceDifferencei = ( PerformanceBasei – PerformanceContendi ) * 100
PerformanceBasePerformanceBasei
Absolute performance difference sum Absolute performance difference sumAPDS = Σ abs ( PercentPerformanceDifferencei )
18
I t li ti t tiInter‐application contention L1-cache – for Streamcluster
8Inter-application L1-cache Contention
2
4
6
iffer
ence
(%)
-4
-2
0
erfo
rman
ce D
-8
-6
chol
es
ytra
ck
anne
al
Ded
up
aces
im
Ferr
et
nim
ate
ytra
ce
ptio
ns
Vips
X264
P e
19
Bla
cksc
Bod
y
Ca D
Fa
Flui
dan
Ray
Swap
Co-running benchmarks
Inter‐application L1‐cache contentionInter application L1 cache contentionStreamcluster
Inter-application L1-cache Contention
68
nce
(%)
-4-2024
man
ce D
iffer
en
-8-64
chol
es
dytra
ck
anne
al
Ded
up
aces
im
Ferr
et
nim
ate
aytra
ce
clus
ter
aptio
ns
Vips
X264
Perfo
rm
20
Bla
cksc
Bod C
a D
Fa
Flui
dan
Ra
Stre
amc
Swa
Co-running benchmarks
I t li ti t tiInter‐application contention L1-cache
2121
I t li ti t tiInter‐application contention L2-cache
22
I t li ti t tiInter‐application contention FSB
23
CharacterizationBenchmarks L1‐cache L2‐cache FSB
Blackscholes none none none
Bodytrack inter inter intra
C l i t i t i tCanneal intra inter intra
Dedup inter intra, inter intra, inter
Facesim inter inter intra
Ferret intra intra, inter intra
Fluidanimate inter inter intra
Raytrace none none intraRaytrace none none intra
Streamcluster inter inter intra
Swaptions none none none
Vi i i i
24
Vips intra inter inter
X264 inter intra, inter intra
Summary The methodology generalizes contention analysis of
multi-threaded applicationsN h t h t i li ti New approach to characterize applications
Useful for performance analysis of existing and future architecture or benchmarksarchitecture or benchmarks
Helpful for creating new workloads of diverse properties
Provides insights for designing improved contention-h d li th daware scheduling methods
25
Related Work Cache contention
Knauerhase et al. IEEE Micro 2008 Zhuravleve et al ASPLOS 2010 Zhuravleve et al. ASPLOS 2010 Xie et al. CMP-MSI 2008 Mars et al. HiPEAC 2011
Characterizing parallel workload Jin et al., NASA Technical Report 2009
PARSEC benchmark suite Bienia et al. PACT 2008 Bhadauria et al IISWC 2009 Bhadauria et al. IISWC 2009
26
Thank you!Thank you!
27