Design Exploration of an Instruction-Based Shared Markov Table on CMPs

Design Exploration of an Instruction-Based Shared Markov Table on CMPs

Karthik Ramachandran & Lixin Su

Design Exploration of an Instruction-Based Shared Markov Table on CMPs

Outline Motivation

Multiple cores on single chip Commercial workloads

Our study Start from Instruction sharing pattern analysis

Our experiments Move onto Instruction cache miss pattern analysis

Our experiments Conclusions

Motivation Technology push: CMPs

Lower access latency to other processors Application pull: Commercial workloads

OS behavior Database applications

Opportunities for shared structures Markov based sharing structure Address large instruction footprint VS. small fast I

caches

Instruction Sharing Analysis How instruction sharing may occur ?

OS: multiple processes, scheduling DB: concurrent transactions, repeated queries,

multiple threads How can CMP’s benefit from instruction sharing ?

Snoop/grab instruction from other cores Shared structures

Let’s investigate it.

Methodology Two-step approach

Experiment I Targets Instruction trace analysisHow much sharing occurs ?

Experiment IITargets I cache miss stream analysisExamine the potential of a shared Markov

structure

Experiment IExperiment I

Add instrumentation code to analyze committed instructions

Focus on repeated sequences of 2, 3, 4, and 5 instructions across 16P

Histogram-based approachP1 P2 P3 P4

{A,B} {A,B} {A,B} {A,B}

{A,B} {A,B} {A,B}

{A,B} {A,B}

{A,B}

How do we Count ?

P1 : 3 times

P2 : 1 time

P3 : 0 times

P4 : 2 times

Total : 10 times

Results - Experiment I Q.) Is there any Instruction sharing ?

A.) Maybe, observe the number of times the sequences 2-5 repeat (~13000 -17000)

Q.) But why does the numbers for a sequence pattern of 5 Instructions not differ much from a sequence pattern of 2 Instructions ?

A.) Spin Loops!!

For non warm-up case : 50%

For warm-up case : 30%

J bb_ 100

0

20004000

60008000

10000

1200014000

1600018000

20000

1 36 71 106 141 176 211 246 281

Transactions

Coun

t

Transactionsseq2seq3seq4seq5

Experiment II Focus on instruction cache misses

Is there sharing involved here too? Upper bound performance benefit of a shared Markov table?

Experiment setup 16K-entry fully associative shared Markov table 16K-entry fully associative shared Markov table Each entry has two consecutive misses from same Each entry has two consecutive misses from same

processorprocessor Atomic lookup and hit/miss counter update when a Atomic lookup and hit/miss counter update when a

processor has two consecutive I $ misses. processor has two consecutive I $ misses. On a miss, Insert a new entry to LRU headOn a miss, Insert a new entry to LRU head On a hit, Record distance from the LRU head and move On a hit, Record distance from the LRU head and move

the hit entry to LRU headthe hit entry to LRU head

Design Block DiagramDesign Block Diagram

P

I$

P

I$

Markov Table

L2 $

• Small fast shared Markov table

• Prefetch when I$ miss occurs

Table Lookup Hit RatioApache 100 transaction with cache warm-up

0

5000

10000

15000

20000

25000

30000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Processors

# of

look

ups

in s

hare

d M

arko

v ta

ble

# of misses # of hits

Q1.) Is there a lot of miss sharing?

Q2.) Does constructive interference pattern exist to help a CMP?

Q3.) Do equal opportunities exist for all the P?

Let’s Answer the Let’s Answer the Questions?Questions?

A1.) Yes Of course

A2.) Definitely a constructive interference pattern exists as you see from the figure

A3.) Yes. Hit/miss ratio remains pretty stable across processor in spite of variance in the number of I cache misses.

CMP vs. uP

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Zeus Jbb Apache

Hit R

ate in

Mark

ov Ta

ble

CMP

uP

How Big Should the Table Be ?

• About 60% of hits are within 4K entries away from LRU head.

• A shared Markov table can fairly utilize I cache miss sharing.

• What about snooping and grabbing instructions from other I caches?

Size Vs. Content

0

0.2

0.4

0.6

0.8

1

1.2

Zeus Jbb Oltp Apache

Percen

tag

e 16k

4k

2k

1k

Real Design IssuesReal Design Issues

Associativity and size of the tableAssociativity and size of the table Choose the right path if multiple Choose the right path if multiple

paths existpaths exist Separate address directory from Separate address directory from

data entries for the table and have data entries for the table and have multiple address directoriesmultiple address directories

What if a sequential prefetcher What if a sequential prefetcher exists?exists?

Conclusions Instruction sharing on CMPs exists. Spin loops occur

frequently with current workloads.

Markov-based structure for storing I cache misses may be helpful on CMPs.

Questions?Questions?

Comparison with Real Comparison with Real Markov PrefetchingMarkov Prefetching

AA BB CC 5

AA EE 2

AA DD FF 3

Cnt

P

• Miss to A and prefetch along A, B & C

AA BB

AA CC

AA DD

BB DD

LRU head

LRU Tail

Hit CntHit Cnt 22

Miss CntMiss Cnt 33

P AA CC

• Misses to A & C and then look up in the table

• Update hit/miss counters and change/record LRU

Lookup Example ILookup Example I

AA BB

AA CC

AA DD

BB DD

LRU head

LRU Tail

P

AA CC

Look up

Hit CntHit Cnt 22

Miss CntMiss Cnt 33

AA CC

AA BB

AA DD

BB DD

LRU head

LRU head

Hit CntHit Cnt 33

Miss CntMiss Cnt 33

Lookup Example IILookup Example II

AA BB

AA CC

AA DD

BB DD

LRU head

LRU Tail

P

CC DD

Look up

Hit CntHit Cnt 22

Miss CntMiss Cnt 33

AA CC

AA BB

AA DD

CC DD

LRU head

LRU head

Hit CntHit Cnt 22

Miss CntMiss Cnt 44

Design Exploration of an Instruction-Based Shared Markov Table on CMPs

Documents

Transcript of Design Exploration of an Instruction-Based Shared Markov Table on CMPs