Design Exploration of an Instruction-Based Shared Markov Table on CMPs
description
Transcript of Design Exploration of an Instruction-Based Shared Markov Table on CMPs
Design Exploration of an Instruction-Based Shared Markov Table on CMPs
Karthik Ramachandran & Lixin Su
Design Exploration of an Instruction-Based Shared Markov Table on CMPs
Outline Motivation
Multiple cores on single chip Commercial workloads
Our study Start from Instruction sharing pattern analysis
Our experiments Move onto Instruction cache miss pattern analysis
Our experiments Conclusions
Motivation Technology push: CMPs
Lower access latency to other processors Application pull: Commercial workloads
OS behavior Database applications
Opportunities for shared structures Markov based sharing structure Address large instruction footprint VS. small fast I
caches
Instruction Sharing Analysis How instruction sharing may occur ?
OS: multiple processes, scheduling DB: concurrent transactions, repeated queries,
multiple threads How can CMP’s benefit from instruction sharing ?
Snoop/grab instruction from other cores Shared structures
Let’s investigate it.
Methodology Two-step approach
Experiment I Targets Instruction trace analysisHow much sharing occurs ?
Experiment IITargets I cache miss stream analysisExamine the potential of a shared Markov
structure
Experiment IExperiment I
Add instrumentation code to analyze committed instructions
Focus on repeated sequences of 2, 3, 4, and 5 instructions across 16P
Histogram-based approachP1 P2 P3 P4
{A,B} {A,B} {A,B} {A,B}
{A,B} {A,B} {A,B}
{A,B} {A,B}
{A,B}
How do we Count ?
P1 : 3 times
P2 : 1 time
P3 : 0 times
P4 : 2 times
Total : 10 times
Results - Experiment I Q.) Is there any Instruction sharing ?
A.) Maybe, observe the number of times the sequences 2-5 repeat (~13000 -17000)
Q.) But why does the numbers for a sequence pattern of 5 Instructions not differ much from a sequence pattern of 2 Instructions ?
A.) Spin Loops!!
For non warm-up case : 50%
For warm-up case : 30%
J bb_ 100
0
20004000
60008000
10000
1200014000
1600018000
20000
1 36 71 106 141 176 211 246 281
Transactions
Coun
t
Transactionsseq2seq3seq4seq5
Experiment II Focus on instruction cache misses
Is there sharing involved here too? Upper bound performance benefit of a shared Markov table?
Experiment setup 16K-entry fully associative shared Markov table 16K-entry fully associative shared Markov table Each entry has two consecutive misses from same Each entry has two consecutive misses from same
processorprocessor Atomic lookup and hit/miss counter update when a Atomic lookup and hit/miss counter update when a
processor has two consecutive I $ misses. processor has two consecutive I $ misses. On a miss, Insert a new entry to LRU headOn a miss, Insert a new entry to LRU head On a hit, Record distance from the LRU head and move On a hit, Record distance from the LRU head and move
the hit entry to LRU headthe hit entry to LRU head
Design Block DiagramDesign Block Diagram
P
I$
P
I$
Markov Table
L2 $
• Small fast shared Markov table
• Prefetch when I$ miss occurs
Table Lookup Hit RatioApache 100 transaction with cache warm-up
0
5000
10000
15000
20000
25000
30000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Processors
# of
look
ups
in s
hare
d M
arko
v ta
ble
# of misses # of hits
Q1.) Is there a lot of miss sharing?
Q2.) Does constructive interference pattern exist to help a CMP?
Q3.) Do equal opportunities exist for all the P?
Let’s Answer the Let’s Answer the Questions?Questions?
A1.) Yes Of course
A2.) Definitely a constructive interference pattern exists as you see from the figure
A3.) Yes. Hit/miss ratio remains pretty stable across processor in spite of variance in the number of I cache misses.
CMP vs. uP
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Zeus Jbb Apache
Hit R
ate in
Mark
ov Ta
ble
CMP
uP
How Big Should the Table Be ?
• About 60% of hits are within 4K entries away from LRU head.
• A shared Markov table can fairly utilize I cache miss sharing.
• What about snooping and grabbing instructions from other I caches?
Size Vs. Content
0
0.2
0.4
0.6
0.8
1
1.2
Zeus Jbb Oltp Apache
Percen
tag
e 16k
4k
2k
1k
Real Design IssuesReal Design Issues
Associativity and size of the tableAssociativity and size of the table Choose the right path if multiple Choose the right path if multiple
paths existpaths exist Separate address directory from Separate address directory from
data entries for the table and have data entries for the table and have multiple address directoriesmultiple address directories
What if a sequential prefetcher What if a sequential prefetcher exists?exists?
Conclusions Instruction sharing on CMPs exists. Spin loops occur
frequently with current workloads.
Markov-based structure for storing I cache misses may be helpful on CMPs.
Questions?Questions?
Comparison with Real Comparison with Real Markov PrefetchingMarkov Prefetching
AA BB CC 5
AA EE 2
AA DD FF 3
Cnt
P
• Miss to A and prefetch along A, B & C
AA BB
AA CC
AA DD
BB DD
LRU head
LRU Tail
Hit CntHit Cnt 22
Miss CntMiss Cnt 33
P AA CC
• Misses to A & C and then look up in the table
• Update hit/miss counters and change/record LRU
Lookup Example ILookup Example I
AA BB
AA CC
AA DD
BB DD
LRU head
LRU Tail
P
AA CC
Look up
Hit CntHit Cnt 22
Miss CntMiss Cnt 33
AA CC
AA BB
AA DD
BB DD
LRU head
LRU head
Hit CntHit Cnt 33
Miss CntMiss Cnt 33
Lookup Example IILookup Example II
AA BB
AA CC
AA DD
BB DD
LRU head
LRU Tail
P
CC DD
Look up
Hit CntHit Cnt 22
Miss CntMiss Cnt 33
AA CC
AA BB
AA DD
CC DD
LRU head
LRU head
Hit CntHit Cnt 22
Miss CntMiss Cnt 44