Performance Predictability Carole’s Group Talk on 5-13-2009.
-
date post
19-Dec-2015 -
Category
Documents
-
view
213 -
download
0
Transcript of Performance Predictability Carole’s Group Talk on 5-13-2009.
Performance Predictability
Carole’s Group Talk on 5-13-2009
What if hmmer is the high-priority application?
bzip2 gcc
gobmk
hmmerlbm mcf
milc
perlbench
sjeng
0.50.55
0.60.65
0.70.75
0.80.85
0.90.95
1
9 Heterogeneous Applications
un-managedalone
Nor
mal
ized
IPC
over
the
Appl
ica-
tion
Runn
ing
Alon
e
Can we predict performance trends?
Number of Instructions
Perf
orm
ance
Performance upper bound of the high-priority application
Performance of the high-priority (HP) applicationwhile running with other non-HP applications
???
Can we predict performance trends?
Number of Instructions
Perf
orm
ance
Performance upper bound of the high-priority application
Performance of the HP applicationrunning with other non-HP applications
???
Performance improvement due to the new dynamic resource allocation setting
If we can predict accurately the performance trend of hmmer by determining the degree of
inter-application inference at run time, we can act early!
Studies have shown there exist a high correlation between application performance and the number of
long-latency cache misses.
We can use cache usages (e.g. LLCache) to identify the degree of inter-application interference at run time, we can predict performance for applications of interest.
Outline
Motivation Observation Cache & Observation Sets Experimental Setup Results Conclusion
How do we identify inter-application conflict misses?
E F G H
.
.
.
.
.
.
.
.
.
.
.
.
A B C D
.
.
.
.
.
.
.
.
.
.
.
.memory references
LRU MRU LRU MRU
I
A, B, C, D, I -- HP references E, F, G, H -- non-HP references
Observation Cache
How do we identify inter-application conflict misses?
F G H I
.
.
.
.
.
.
.
.
.
.
.
.
A B C D
.
.
.
.
.
.
.
.
.
.
.
.memory references
LRU MRU LRU MRU
A
Is A an inter-application conflict miss? No!
A, B, C, D, I -- HP references E, F, G, H -- non-HP references
Observation Cache
How do we identify inter-application conflict misses?
F G H I
.
.
.
.
.
.
.
.
.
.
.
.
A B C D
.
.
.
.
.
.
.
.
.
.
.
.memory references
LRU MRU LRU MRU
# of non-HP cache lines
.
.
.
.
.
.
.
.
.
.
.
.
3
if( (set_assoc – lru_hit_cnt) <= #_non_HP_lines ){ inter_app miss++;de_allocate( address );
}else non_inter_app_miss++;
A, B, C, D, I -- HP references E, F, G, H -- non-HP references
0 set_assoc-1
Observation Cache
Do we really need all these crazy hardware?
• 4MB observation cache + 4096 32-bit counters (for each cache set)
Hopefully not!!
Approach 1: Dynamic profiling within way-partitioning infrastructure
Given a way-partitioned 16-way set-associative cache, of which n ways (n<8) are dedicated to a HP-app. Use the n ways for HP-app’s exclusive access as
the observation cache to measure the degree of inter-application interference and the rest of (16-n) ways as the normal shared cache.
Approach 1: Dynamic profiling within way-partitioning infrastructure
E F G I
.
.
.
.
.
.
.
.
.
.
.
.
C B A
.
.
.
.
.
.
.
.
.
.
.
.memory references
MRU LRU
.
.
.
.
.
.
.
.
.
.
.
.
4H
Observation Cache
Approach 2: Set sampling
There are 4096 sets in a 16-way set-associative, 4MB cache with 64-byte lines Spare a small part of the entire cache as “leader sets” and
use the rest as “follower sets” [Utility-based Cache Partition, Qureshi and Patt, MICRO ‘06]
Approach 2: Set sampling version1
memory references
E9 F4 G3 I2 A0 B5 H8 D9
.
.
.
.
.
.
E1 E2 G8 P1 E4 F6 G1 P5
P1 P5 P7 P9 P2 P3 P8 P0
sam
e se
t
m1
m1’
The number of inter-application conflict miss is (m1’-m1) for the high-priority application!
MRU LRU
Pi -- HP references Ei, Gi-- non-HP references
Approach 2: Set sampling version2
E F G Imemory references
MRU LRU
6
A B H D
.
.
.
.
.
.
E1 E2 G8 P4 E4 F6 G1 P6
P1 P5 P7 P9 P2 P3 P8 P0
Pi -- HP references Ei, Gi-- non-HP references
# of non-HP cache lines
Exte
nded
Se
t
Observation Set
Outline
Motivation Observation Cache & Observation Sets Experimental Setup Results Conclusion
Experimental Setup
• GEMS: Simics (in-order) + Ruby (memory module)
• 8-core CMP• Private L1 cache per core (32KB, 4-way set-assoc., 64-
byte cache lines)• Shared L2 cache (4MB, 16-way set-assoc., 64-byte
cache lines) “4096 cache sets”
• 4 SPEC2006 applications in the workload• bzip2, mcf, gobmk, and hmmer [10 billion cycles]
Miss Identification for Applications in the Workload
0200000400000600000
bzip2 (bmgh)
InterApp_Misses NonInterApp_Misses
Time (cycles)Num
ber o
f Mis
ses
0100000200000300000400000500000
mcf (bmgh)
InterApp_Misses NonInterApp_Misses
Time (cycles)
Num
ber o
f Mis
ses
0200000400000600000
gobmk (bmgh)
InterApp_Misses NonInterApp_Misses
Time (cycles)Num
ber o
f Mis
ses
050000
100000150000200000
hmmer (bmgh)
InterApp_Misses NonInterApp_Misses
Time (cycles)
Num
ber o
f Mis
ses
How well do observation sets perform [%10]?
0 5000000000 100000000000
50000100000150000200000250000
020406080100
bzip2
InterApp_MissesInterApp_Miss (Observation Sets)Error Rate (%)
Time (cycles)
Num
ber o
f Mis
ses
Erro
r Rat
e (%
)0 5000000000 10000000000
050000
100000150000200000250000
020406080100
mcf
InterApp_MissesInterApp_Miss (Observation Sets)Error Rate (%)
Time (cycles)
Num
ber o
f Mis
ses
Erro
r Rat
e (%
)
How well do observation sets perform [%10]?
0 5000000000 100000000000
50000100000150000200000250000
020406080100
gobmk
InterApp_MissesInterApp_Miss (Observation Sets)Error Rate (%)
Time (cycles)
Num
ber o
f Mis
ses
Erro
r Rat
e (%
)0 5000000000 10000000000
0
50000
100000
150000
200000
020406080100
hmmer
InterApp_MissesInterApp_Miss (Observation Sets)Error Rate (%)
Time (cycles)
Num
ber o
f Mis
ses
Erro
r Rat
e (%
)
# of observation sets vs. prediction accuracy
bzip2 mcf gobmk hmmer average0
1
2
3
4
5
6
7
8
9
10
Precision for Observation Sets
[%10][%40][%64][%67][%100]
Erro
r Rat
e (%
) 10% of the 4MB cache as observation sets offers >99% prediction accuracy
1% of the 4MB cache as observation sets offers >97% prediction accuracy
Can we predict accurately the performance trend by determining the degree of inter-
application inference at run time?
Hopefully, I have convinced you that it is possible!
Conclusion
Dynamic profiling is powerful Feasible – no pre-run (static profiling) ever!
Accurate run-time prediction for performance trends provides Performance predictability Quality of service Better resource allocation decisions
With 10% of the 4MB cache as observation sets, we can achieve >99% prediction accuracy. 1% of the 4MB cache >97% prediction accuracy
Still awake?
Thank you
Next…
Summer internship with Google on power usage prediction algorithms for
Google’s data centers. Welcome to visit me in Mountain View, CA !!!
Cache Occupancy per Application
0
10
20
30
40
50
60
70
80
90
100
Cache Occupancy per Application (bmgh)
OS activitieshmmergobmkmcfbzip2
Time (cycles)
Cach
e O
ccup
ancy
(%)