Tanima ISPASS2011 presentation 59ispass.org/ispass2011/slides/2_4.pdfTanima Dey Wee a g, Jac a dso ,...

Tanima DeyWei Wang, Jack W. Davidson, Mary L. Soffae a g, Jac a dso , a y So a

Department of Computer ScienceUniversity of Virginia

ISPASS 2011

y g

1

M i iMotivation The number of cores doubles every 18 months Expected: Performance number of cores One of the bottlenecks is shared resource contention

For multi-threaded workloads, contention is unavoidable

To reduce contention it is necessary to understand To reduce contention, it is necessary to understand where and how the contention is created

2

Shared Resource Contention inShared Resource Contention in Chip‐Multiprocessorsp p

C C C C Application 1C0 C1 C2 C3

L1 L1L1 L1

Application 1 Thread


L2 L2

Front -Side Bus

Thread

Memory

Intel Quad Core Q95503

Scenario 1Scenario 1 Multi‐threaded applicationspp With co-runner

C0 C1 C2 C3


3

L L

L1 L1L1 L1Application 2

Thread

L2 L2

MMemory

4

Scenario 2Scenario 2Multi‐threaded applications Without co-runner

pp

C0 C1 C2 C3

Application Thread

L2 L2

L1 L1L1 L1

L2 L2

MemoryMemory

5

Shared‐Resource Contention Intra application contention Intra-application contention

Contention among threads from the same application (No co-runners)( )

Inter-application contention Contention among threads from the co-running

application

6

C ib iContributions A general methodology to evaluate a multi-threaded g gy

application’s performance Intra-application contention Inter-application contention Contention in the memory-hierarchy shared resources

Characterizing applications facilitates better understanding of the application’s resource sensitivityunderstanding of the application s resource sensitivity

Thorough performance analyses and characterizationThorough performance analyses and characterization of multi-threaded PARSEC benchmarks

7

O tliOutline MotivationMotivation Contributions Methodologygy Measuring intra-application contention Measuring inter-application contentiong pp Related Work Summary

8

MethodologyMethodology Designed to measure both intra- and inter-

application contention for a targeted shared resourceapplication contention for a targeted shared resource L1-cache, L2-cache Front Side Bus (FSB)

Each application is run in two configurations Baseline: threads do not share the targeted resource Contention: threads share the targeted resource

Multiple number of targeted resource Determine contention by comparing performance

9

Determine contention by comparing performance (gathering hardware performance counters’ values)

O tliOutline MotivationMotivation Contributions Methodologygy Measuring intra-application contention (See paper) Measuring inter-application contentiong pp Related Work Summary

10

L1-cacheMeasuring inter‐application contention


C0 C1 C2 C3

L1 L1L1 L1

Thread


C0 C1 C2 C3

L1 L1L1 L1

L2 L2 L2 L2

Baseline Contention

Memory Memory

Baseline Configuration

Contention Configuration

11

lMeasuring inter‐application contention L2-cache


C0 C1 C2 C3


Thread

C0 C1 C2 C3

L1 L1L1 L1

L2 L2 L2 L2

Memory Memory



12

M i i t li ti t tiMeasuring inter‐application contention FSB


C0 C2 C4 C6

L1 L1L1 L1

C1 C3 C5 C7

L1 L1L1 L1

Thread


L2 L2 L2 L2

Memory


13

lMeasuring intra‐application contention FSB


C0 C2 C4 C6

L1 L1L1 L1

C1 C3 C5 C7


Thread

L2 L2 L2 L2

Memory


14

PARSEC BenchmarksApplication Domain Benchmark(s)Application Domain Benchmark(s)

Financial Analysis Blackscholes (BS)Swaptions (SW)

C t Vi i B d t k (BT)Computer Vision Bodytrack (BT)

Engineering Canneal (CN)

Enterprise Storage Dedup (DD)

Animation Facesim (FA)Fluidanimate (FL)

Similarity Search Ferret (FE)Similarity Search Ferret (FE)

Rendering Raytrace (RT)

Data Mining Streamcluster (SC)

15

Media Processing Vips (VP)X264 (X2)

Experimental platformExperimental platform Platform 1: Yorkfield

C C C C Intel Quad core Q9550 32 KB L1-D and L1-I

h

C0

L1 cache

L1

C1 C2 C3

L1 cache

L1

L1 cache

L1

L1 cache

L1 cache 6MB L2-cache 2GB Memory

L2 cache L2 cacheL2 L2

L1 HW‐PF

L1 HW‐PF

L1 HW‐PF

L1 HW‐PF

2GB Memory Common FSB FSB

interface

L2 HW‐PF

FSB interface

L2 HW‐PF

Memory Controller Hub (Northbridge)

FSB

Memory

MB

1616

Experimental platformExperimental platform Platform 2: Harpertown

C0

L1 cache

C2 C4 C6

L1 cache L1 cache L1 cache

C1

L1 cache

C3 C5 C7

L1 cache L1 cache L1 cache

L2 cache L2 cache

L1 HW‐PF

L1 HW‐PF

L1 HW‐PF

L1 HW‐PF

L2 cache L2 cache

L1 HW‐PF

L1 HW‐PF

L1 HW‐PF

L1 HW‐PF

L2 cache

FSB interface

L2 cacheL2

HW‐PFFSB

interface

L2 HW‐PF

L2 cache

FSB interface

L2 cacheL2

HW‐PFFSB

interface

L2 HW‐PF

Memory Controller Hub (Northbridge)FSB FSB

Tanima DeyMemory

MB

1717

Performance Analysis Inter-application contention

For i-th co-runnerPercentPerformanceDifferencei = ( PerformanceBasei – PerformanceContendi ) * 100

PerformanceBasePerformanceBasei

Absolute performance difference sum Absolute performance difference sumAPDS = Σ abs ( PercentPerformanceDifferencei )

18

I t li ti t tiInter‐application contention L1-cache – for Streamcluster

8Inter-application L1-cache Contention

2

4

6

iffer

ence

(%)

-4

-2

0

erfo

rman

ce D

-8

-6

chol

es

ytra

ck

anne

al

Ded

up

aces

im

Ferr

et

nim

ate

ytra

ce

ptio

ns

Vips

X264

P e

19

Bla

cksc

Bod

y

Ca D

Fa

Flui

dan

Ray

Swap

Co-running benchmarks

Inter‐application L1‐cache contentionInter application L1 cache contentionStreamcluster

Inter-application L1-cache Contention

68

nce

(%)

-4-2024

man

ce D

iffer

en

-8-64

chol

es

dytra

ck

anne

al

Ded

up

aces

im

Ferr

et

nim

ate

aytra

ce

clus

ter

aptio

ns

Vips

X264

Perfo

rm

20

Bla

cksc

Bod C

a D

Fa

Flui

dan

Ra

Stre

amc

Swa

Co-running benchmarks

I t li ti t tiInter‐application contention L1-cache

2121

I t li ti t tiInter‐application contention L2-cache

22

I t li ti t tiInter‐application contention FSB

23

CharacterizationBenchmarks L1‐cache L2‐cache FSB

Blackscholes none none none

Bodytrack inter inter intra

C l i t i t i tCanneal intra inter intra

Dedup inter intra, inter intra, inter

Facesim inter inter intra

Ferret intra intra, inter intra

Fluidanimate inter inter intra

Raytrace none none intraRaytrace none none intra

Streamcluster inter inter intra

Swaptions none none none

Vi i i i

24

Vips intra inter inter

X264 inter intra, inter intra

Summary The methodology generalizes contention analysis of

multi-threaded applicationsN h t h t i li ti New approach to characterize applications

Useful for performance analysis of existing and future architecture or benchmarksarchitecture or benchmarks

Helpful for creating new workloads of diverse properties

Provides insights for designing improved contention-h d li th daware scheduling methods

25

Related Work Cache contention

Knauerhase et al. IEEE Micro 2008 Zhuravleve et al ASPLOS 2010 Zhuravleve et al. ASPLOS 2010 Xie et al. CMP-MSI 2008 Mars et al. HiPEAC 2011

Characterizing parallel workload Jin et al., NASA Technical Report 2009

PARSEC benchmark suite Bienia et al. PACT 2008 Bhadauria et al IISWC 2009 Bhadauria et al. IISWC 2009

26

Thank you!Thank you!

27

Tanima ISPASS2011 presentation 59ispass.org/ispass2011/slides/2_4.pdfTanima Dey Wee a g, Jac a dso ,...

Documents

Transcript of Tanima ISPASS2011 presentation 59ispass.org/ispass2011/slides/2_4.pdfTanima Dey Wee a g, Jac a dso ,...