Zvika Guz 1, Oved Itzhak 1, Idit Keidar 1, Avinoam Kolodny 1, Avi Mendelson 2, and Uri C. Weiser 1...
-
date post
19-Dec-2015 -
Category
Documents
-
view
217 -
download
0
Transcript of Zvika Guz 1, Oved Itzhak 1, Idit Keidar 1, Avinoam Kolodny 1, Avi Mendelson 2, and Uri C. Weiser 1...
Zvika Guz1, Oved Itzhak1, Idit Keidar1, Avinoam Kolodny1, Avi Mendelson2, and Uri C. Weiser1
Threads vs. Caches: Modeling the Behavior of Parallel Workloads
1Technion – Israel Institute of Technology, 2Microsoft Corporation
Challenges: Single-core performance trend is gloomy
Exploit chip-multiprocessors with multithreaded applications
The memory gap is paramount Latency, bandwidth, power
2
Chip-Multiprocessor Era
2[Figure: Hennessy and Patterson, Computer Architecture- A Quantitative approach]
Two basic remedies: Cache – Reduce the number of out-of-die memory accesses Multi-threading – Hide memory accesses behind threads execution
How do they play together? How do we make the most out of them?
The many-core span Cache-Machines ↔ MT-Machines
A high-level analytical model Performance curves study
Few examples
Summary
3
Outline
3
The many-core span Cache-Machines ↔ MT-Machines
A high-level analytical model Performance curves study
Few examples
Summary
4
Outline
4
Cache-Machines vs. MT-Machines
# of Threads
Cache/Thread
Thread Context
Cache
Cache Architecture
Region
Many-Core – CMP with many, simple cores Tens hundreds of Processing Elements (PEs)
MT Architecture
Region
Intel’s Larrabee
…
Nvidia’s GT200
5
Nvidia’s Fermi
Cache
Core
Multi-Core
Region
Uni-Processor
Region
Cache
cccc
What are the basic tradeoffs? How will workloads behave across the range?
Predicting performance
The many-core span Cache-Machines ↔ MT-Machines
A high-level analytical model Performance curves study
Few examples
Summary
6
Outline
6
Use both cache and many threads to shield memory access The uniform framework renders the comparison meaningful We derive simple, parameterized equations for performance, power, BW,..
A Unified Machine Model
7
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
Cache
To Memory
Threads Architectural States
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C
C
C
C
C C
C C
C C
C C
C
C
C
C
Cache Machines
8
C
Many cores (each may have its private L1) behind a shared cache
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C
C
C
C
Cache
To Memory
C
C
C
C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
# Threads
Performance
Cache Non Effective point (CNE)
Memory latency shielded by multiple thread execution
Multi-Thread Machines
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
To Memory
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
Threads Architectural States
Ban
dw
idth
L
imit
atio
ns
# Threads
PerformanceMax performance
executionMemory access
9
Analysis (1/3) Given a ratio of memory access instructions rm (0≤rm≤1)
Every 1/rm instruction accesses memory A thread executes 1/rm instructions
Then stalls for tavg cycles
tavg=Average Memory Access Time (AMAT) [cycles]
10
Cache
Thread Context
t [cycles]
ld
1CPIexerm
avgt
ld
PE stays idle unless filled with instructions from other threads Each thread occupies the PE for additional cycles
threads needed to fully utilize each PE
Analysis (2/3)
t [cycles]
ld
1CPIexerm
avgt
ld ld ld ld
1CPIexerm
1exe
avg
m
CPI
r
t
1CPIexerm
11
Cache
Thread Context
Analysis (3/3) Machine utilization:
Performance in Operations Per Seconds [OPS]:
1min 1, threads
avgm
PEexe
rN tCPI
n
Number of available threads
[ ]PEexe
fPerformance N OPS
CPI
Peak Performance
#Threads needed to utilize a single PE
12
Cache
Thread Context
Performance Model
13
$ $ $
,
min , [ ]1 $,
( , ) 1 ( , )
PEexe
max
m reg hit threads
max
ex m hit hit mem
Power
fN
CPI
BWPerformance OPS
r b P n
e r P S n e P S n e
1 av
threads
mPE
exg
e
n
rN
CPIt
min 1 ,Machine Utilization
$ [ ]$, 1 $, hit threads hit threads mavg cyclesAMAT P n tt t P n
PE Utilization
Off-Chip BW
Power
The many-core span Cache-Machines ↔ MT-Machines
A high-level analytical model Performance curves study
Few examples
Summary
14
Outline
14
15
# Threads
3 regions: Cache efficiency region, The Valley, MT efficiency region
Unified Machine PerformanceP
erfo
rman
ce
Ca
ch
e r
egio
n
MT regionThe Valley
0
100
200
300
400
500
600
700
800
900
1000
1100
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
1000
0
1100
0
1200
0
1300
0
1400
0
1500
0
1600
0
1700
0
1800
0
1900
0
2000
0
GO
PS
Number Of Threads
Performance for Different Cache Sizes (Limited BW)
no $
16M
32M
64M
128M
perfect $
Increase in cache size cache suffices for more in-flight threads Extends the $ region
17
Increase in cache size
Cache Size Impact
..AND also Valuable in the MT region Caches reduce off-chip bandwidth delay the BW saturation point
Simulation results from the PARSEC workloads kit Swaptions:
Perfect Valley
Hit Rate Function Impact
Swaptions
0
20
40
60
80
100
120
140
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Number Of Threads
Pe
rfo
rma
nc
e (
GO
PS
)
0
10
20
30
40
50
60
70
80
90
100
Ca
ch
e H
it R
ate
(%
)
Analytical Model
Simulation
Cache Hit Rate
19
Simulation results from the PARSEC workloads kit Raytrace:
Monotonically-increasing performance
Hit Rate Function Impact
Raytrace
0
10
20
30
40
50
60
70
80
90
100
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Number Of Threads
Pe
rfo
rma
nc
e (
GO
PS
)
0
10
20
30
40
50
60
70
80
90
100
Ca
ch
e H
it R
ate
(%
)
Analytical Model
Simulation
Cache Hit Rate
20
Three applications families based on cache miss rate dependency: A “strong” function of number of threads – f(Nq) when q>1 A “weak” function of number of threads - f(Nq) when q≤1 Not a function of number of threads
Threads
Per
form
ance
Hit Rate Dependency – 3 ClassesP
erfo
rman
ce
# Threads
21
Simulation results from the PARSEC workloads kit Canneal
Not enough parallelism available
Workload Parallelism Impact
Canneal
0
2
4
6
8
10
12
14
16
18
20
22
24
26
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Number Of Threads
Pe
rfo
rma
nc
e (
GO
PS
)
0
10
20
30
40
50
60
70
80
90
100
Ca
ch
e H
it R
ate
(%
)
Simulation
Analytical Model
Cache Hit Rate
22
The many-core span Cache-Machines ↔ MT-Machines
A high-level analytical model Performance curves study
Few examples
Summary
23
Outline
23
A high-level model for many-core engines A unified framework for machines and workloads from across the range
A vehicle to derive intuition Qualitative study of the tradeoffs A tool to understand parameters impact Identifies new behaviors and the applications that exhibit them Enables reasoning of complex phenomena
First step towards escaping the valley
24
Summary
24
Thank [email protected]
25
Backup
25
26
Model Parameters
26
27
Model Parameters
27
Parameter Description
NPENumber of PEs (in-order processing elements)
S$Cache size [Bytes]
NmaxMaximal number of thread contexts in the register file
CPIexeAverage number of cycles required to execute an instruction assuming a perfect (zero-latency) memory system [cycles]
f Processor frequency [Hz]
t$Cache latency [cycles]
tmMemory latency [cycles]
BWmaxMaximal off-chip bandwidth [GB/sec]
bregOperands size [Bytes]
Machine parameters:
28
Model Parameters
28
Workload parameters:
Parameter Description
n Number of threads that execute or are in ready state (not blocked) concurrently
rmFraction of instructions accessing memory out of the total number of instructions [0≤rm≤1]
Phit(s, n) Cache hit rate for each thread, when n threads are using a cache of size s
29
Model Parameters
29
Power parameters:
Parameter Description
eexEnergy per operation [j]
e$Energy per cache access [j]
emem Energy per memory access [j]
PowerleakageLeakage power [W]
30
Parsec Workloads
30
Model Validation, PARSEC Workloads
Raytrace
0
10
20
30
40
50
60
70
80
90
100
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Number Of Threads
Pe
rfo
rma
nc
e (
GO
PS
)
0
10
20
30
40
50
60
70
80
90
100
Ca
ch
e H
it R
ate
(%
)
Analytical Model
Simulation
Cache Hit Rate
Dedup
0
10
20
30
40
50
60
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Number Of ThreadsP
erf
orm
an
ce
(G
OP
S)
0
10
20
30
40
50
60
70
80
90
100
Ca
ch
e H
it R
ate
(%
)
Analytical Model
Simulation
Cache Hit Rate
Canneal
0
2
4
6
8
10
12
14
16
18
20
22
24
26
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Number Of Threads
Pe
rfo
rma
nc
e (
GO
PS
)
0
10
20
30
40
50
60
70
80
90
100
Ca
ch
e H
it R
ate
(%
)
Simulation
Analytical Model
Cache Hit Rate
Bodytrack
0
1
2
3
4
5
6
7
8
9
10
0 20 40 60 80 100 120 140 160 180 200
Number Of Threads
Pe
rfo
rma
nc
e (
GO
PS
)
0
10
20
30
40
50
60
70
80
90
100
Ca
ch
e H
it R
ate
(%
)
Analytical Model
Simulation
Cache Hit Rate
Swaptions
0
20
40
60
80
100
120
140
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Number Of Threads
Pe
rfo
rma
nc
e (
GO
PS
)
0
10
20
30
40
50
60
70
80
90
100
Ca
ch
e H
it R
ate
(%
)
Analytical Model
Simulation
Cache Hit Rate
Blackscholes
0
20
40
60
80
100
120
140
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Number Of Threads
Pe
rfo
rma
nc
e (
GO
PS
)
0
10
20
30
40
50
60
70
80
90
100
Ca
ch
e H
it R
ate
(%
)Analytical Model
Simulation
Cache Hit Rate
Related Work
32
Similar approach of using high level models: Morad et al., CA-Letters 2005 Hill and Michael, IEEE Computer 2008 Eyerman and Eeckhout, ISCA-2010
Related Work
33
Agrawal, TPDS-1992
Saavedra-Barrera and Culler, Berkeley 1991
Sorin et al., ISCA-1998
Hong and Kim, ISCA-2009
Baghsorkhi et al., PPoPP-2010
Thread Context
Cache
Cache Architecture
Region
MT Architecture
Region
Cache
Core
Multi-Core
Region
Uni-Processor
Region
Cache
cccc