CACM July 2012
description
Transcript of CACM July 2012
1
CACM July 2012
Talk: Mark D. Hill, Wisconsin@ Cornell University, 10/2012
2
Executive Summary• Today chips provide shared memory w/ HW coherence
as low-level support for OS & application SW• As #cores per chip scales?
o Some argue HW coherence gone due to growing overheadso We argue it’s stays by managing overheads
• Develop scalable on-chip coherence proof-of-concepto Inclusive caches firsto Exact tracking of sharers & replacements (key to analysis)o Larger systems need to use hierarchy (clusters)o Overheads similar to today’s
Compatibility of on-chipHW coherence is here to stay
Let’s spend programmer sanity on parallelism,not lost compatibility!
3
OutlineMotivation & Coherence Background
Scalability Challenges1. Communication2. Storage3. Enforcing Inclusion4. Latency5. Energy
Extension to Non-Inclusive Shared Caches
Criticisms & Summary
4
Academics Criticize HW Coherence
• Choi et al. [DeNovo]:o Directory…coherence…extremely
complex & inefficient .... Directory … incurring significant storage and invalidation traffic overhead.
• Kelm et al. [Cohesion]:o A software-managed coherence protocol ...
avoids .. directories and duplicate tags , & implementing & verifying … less traffic ...
5
Industry Eschews HW Coherence
• Intel 48-Core IA-32 Message-Passing Processor … SW protocols … to eliminate the communication & HW overhead
• IBM Cell processor … the greatest opportunities for increased application performance, is the existence of the local store memory and the fact that software must manage this memory BUT…
6Source: Avinash Sodani "Race to Exascale: Challenges and Opportunities,“ Micro 2011.
7
Define “Coherence as Scalable”
• Define a coherent system as scalable whenthe cost of providing coherence grows (at most) slowly as core count increases
• Our Focuso YES: coherenceo NO: Any scalable system also requires scalable HW (interconnects,
memories) and SW (OS, middleware, apps)
• Methodo Identify each overhead & show it can grow slowly
• Expect more coreso Moore Law’s provide more transistorso Power-efficiency improvements (w/o Dennard Scaling)o Experts disagree on how many core possible
8
Caches & Coherence• Cache— fast, hidden memory—to reduce
o Latency: average memory access timeo Bandwidth: interconnect traffico Energy: cache misses cost more energy
• Caches hidden (from software)o Naturally for single core systemo Via Coherence Protocol for multicore
• Maintain coherence invariant o For a given (memory) block at a give time eithero Modified (M): A single core can read & writeo Shared (S): Zero or more cores can read, but not write
Interconnection network
tracking bits state tag block data
Core 1
Private cache
state tag block data
Core 2
Private cache
Core C
Private cache
Block in private cache
Block in shared cache
~2 bits ~64 bits ~512 bits
~C bits ~2 bits ~64 bits ~512 bits
9
Baseline Multicore Chip
•Intel Core i7 like
•C = 16 Cores (not 8)
•Private L1/L2 Caches
•Shared Last-Level Cache (LLC)
•64B blocks w/ ~8B tag
•HW coherence pervasive in general-purpose multicore chips: AMD, ARM, IBM, Intel, Sun (Oracle)
Interconnection network
tracking bits state tag block data
Core 1
Private cache
state tag block data
Core 2
Private cache
Core C
Private cache
Block in private cache
Block in shared cache
~2 bits ~64 bits ~512 bits
~C bits ~2 bits ~64 bits ~512 bits
10
Baseline Chip Coherence
• 2B per 64+8B L2 block to track L1 copies• Inclusive L2 (w/ recall messages on LLC evictions)
11
Coherence Example Setup
• Block A in no private caches: state Invalid (I)• Block B in no private caches: state Invalid (I)
Interconnection network
Core 0
Private cache
Core 1
Private cache
Core 2
Private cache
Core 3
Private cache
Bank 0 Bank 1{0000} I …A:
Bank 2 Bank 3
{0000} I …B:
12
Coherence Example 1/4
• Block A at Core 0 exclusive read-write: Modified(M)
Interconnection network
Core 0
Private cache
Core 1
Private cache
Core 2
Private cache
Core 3
Private cache
Bank 0 Bank 1{0000} I …A:
Bank 2 Bank 3
{0000} I …B:
Write A
M, …A:
{1000} M …
13
Coherence Example 2/4
• Block B at Cores 1+2 shared read-only: Shared (S)
Interconnection network
Core 0
Private cache
Core 1
Private cache
Core 2
Private cache
Core 3
Private cache
Bank 0 Bank 1{1000} M …A:
Bank 2 Bank 3
{0000} I …B:
Read B
M, …A:S, …B:
{0100} S …
Read B
S, …B:
{0110} S …
14
Coherence Example 3/4
• Block A moved from Core 0 to 3 (still M)
Interconnection network
Core 0
Private cache
Core 1
Private cache
Core 2
Private cache
S, …
Core 3
Private cache
B: S, …B:
Bank 0 Bank 1{1000} M …A:
Bank 2 Bank 3
{0110} S …B:
M, …A:
Write A
M, …A:
{0001} M …
15
Coherence Example 4/4
• Block B moved from Cores1+2 (S) to Core 1 (M)
Interconnection network
Core 0
Private cache
Core 1
Private cache
Core 2
Private cache
Core 3
Private cache
S, …B: S, …B:
Bank 0 Bank 1A:
Bank 2 Bank 3
{0110} S …B:
M, …B:
Write B
M, …A:
{0001} M …{1000} M …
16
Caches & Coherence
17
OutlineMotivation & Coherence Background
Scalability Challenges1.Communication: Extra bookkeeping messages
(longer section)2. Storage: Extra bookkeeping storage3. Enforcing Inclusion: Extra recall messages (subtle)4. Latency: Indirection on some requests5. Energy: Dynamic & static overhead
Extension to Non-Inclusive Shared Caches (subtle)
Criticisms & Summary
18
1. Communication: (a) No Sharing, Dirty
Interconnection network
Core 1
Private cache
Core 2
Private cache
Core C
Private cache
o W/o coherence: RequestDataData(writeback)o W/ coherence: RequestDataData(writeback)Acko Overhead = 8/(8+72+72) = 5% (independent of
#cores!)
Key:Green for RequiredRed for OverheadThin is 8-byte controlThick is 72-byte data
19
1. Communication: (b) No Sharing, Clean
Interconnection network
Core 1
Private cache
Core 2
Private cache
Core C
Private cache
o W/o coherence: RequestData0o W/ coherence: RequestData(Evict)Acko Overhead = 16/(8+72) = 10-20% (independent of
#cores!)
Key:Green for RequiredRed for OverheadThin is 8-byte controlThick is 72-byte data
20
1. Communication: (c) Sharing, Read
Interconnection network
Core 1
Private cache
Core 2
Private cache
Core C
Private cache
o To memory: RequestDatao To one other core: RequestForwardData(Cleanup)o Charge 1-2 Control messages (independent of #cores!)
Key:Green for RequiredRed for OverheadThin is 8-byte controlThick is 72-byte data
21
1. Communication: (d) Sharing, Write
Interconnection network
Core 1
Private cache
Core 2
Private cache
Core C
Private cache
o If Shared at C other coreso Request{Data, C Invalidations + C Acks}(Cleanup)o Needed since most directory protocols send invalidations
to caches that have & sometimes do not have copieso Not Scalable
Key:Green for RequiredRed for OverheadThin is 8-byte controlThick is 72-byte data
22
1. Communication: Extra Invalidations
Interconnection network
Core 1
Private cache
Core 2
Private cache
Core C
Private cache
o Core 1 Read: RequestDatao Core C Write: Request{Data, 2 Inv + 2 Acks}(Cleanup)o Charge Write for all necessary & unnecessary invalidationso What if all invalidations necessary? Charge reads that
get data!
Key:Green for RequiredRed for OverheadThin is 8-byte controlThick is 72-byte data
{1|2 3|4 .. C-1|C}{ 0 0 .. 0 }{ 1 0 .. 0 }{ 0 0 .. 1 }
23
1. Communication: No Extra Invalidations
Interconnection network
Core 1
Private cache
Core 2
Private cache
Core C
Private cache
o Core 1 Read: RequestData + {Inv + Ack} (in future)o Core C Write: RequestData(Cleanup)o If all invalidations necessary, coherence addso Bounded overhead to each miss -- Independent of
#cores!
Key:Green for RequiredRed for OverheadThin is 8-byte controlThick is 72-byte data
{1 2 3 4 .. C-1 C}{0 0 0 0 .. 0 0}{1 0 0 0 .. 0 0}{0 0 0 0 .. 0 1}
24
1. Communication Overhead
(1) Communication overhead bounded & scalable
(a) Without Sharing & Dirty(b) Without Sharing & Clean(c) Shared Read Miss (charge future inv + ack)(d) Shared Write Miss (not charged for inv + acks)
• But depends on tracking exact sharers (next)
25
Total CommunicationC Read Misses per Write Miss
Exact (unbounded storage)
Inexact (32b coarse vector)How get performance of “exact” w/ reasonable storage?
02
46
832
128
512
0
100
200
300
400
500
600
700
14
1664
25610
24
Read misses per write miss
Byt
es p
er m
iss
Cores0
24
68
3212
851
2
0
100
200
300
400
500
600
700
1
8
6451
2
Read misses per write miss
Byt
es p
er m
iss
Cores
26
OutlineMotivation & Coherence Background
Scalability Challenges1. Communication: Extra bookkeeping messages
(longer section)2.Storage: Extra bookkeeping storage3. Enforcing Inclusion: Extra recall messages4. Latency: Indirection on some requests5. Energy: Dynamic & static overhead
Extension to Non-Inclusive Shared Caches
Criticisms & Summary
27
2. Storage Overhead (Small Chip)
• Track up to C=#readers (cores) per LLC block• Small #Cores: C bit vector acceptable
o e.g., 16 bits for 16 cores : 2 bytes / 72 bytes = 3%
Interconnection network
tracking bits state tag block data
Core 1
Private cache
state tag block data
Core 2
Private cache
Core C
Private cache
Block in private cache
Block in shared cache
~2 bits ~64 bits ~512 bits
~C bits ~2 bits ~64 bits ~512 bits
28
2. Storage Overhead (Larger Chip)
• Use Hierarchy!
core
private cache
core
private cache
core
private cache
Intra-clusterInterconnection network
Cluster of K cores
trackingstate bits tag block data Cluster Cache
Inter-cluster Interconnection network
core
private cache
core
private cache
core
private cache
Intra-clusterInterconnection network
Cluster of K cores
trackingstate bits tag block data Cluster Cache Cache
trackingstate bits tag block data
Shared last-level cache
Cluster 1 Cluster K
{11..1 … 10..1} S …
{11..1} S … {10..1} S …
{1 … 1} S …
29
2. Storage Overhead (Larger Chip)
• Medium-Large #Cores: Use Hierarchy!o Cluster: K1 cores with L2 cluster cacheo Chip: K2 clusters with L3 global cacheo Enables K1*K2 Cores
• E.g., 16 16-core clusterso 256 cores (16*16)o 3% storage overhead!!
• More generally?
30
Storage Overhead for Scaling
(2) Hierarchy enables scalable storage
16 clusters of16 cores each
31
OutlineMotivation & Coherence Background
Scalability Challenges1. Communication: Extra bookkeeping messages (longer
section)2. Storage: Extra bookkeeping storage3.Enforcing Inclusion: Extra recall messages (subtle)4. Latency: Indirection on some requests5. Energy: dynamic & static overhead
Extension to Non-Inclusive Shared Caches (subtle)
Criticisms & Summary
32
3. Enforcing Inclusion (Subtle)
• Inclusion: Block in a private cache In shared cache
+ Augment shared cache to trackprivate cache sharers (as assumed)
- Replace in shared cache Replace in private c.- Make impossible?
- Requires too much shared cache associativity - E.g., 16 cores w/ 4-way caches 64-way assoc
- Use recall messages
• Make recall messages necessary & rare
33
Inclusion Recall Example
• Shared cache miss to new block C• Needs to replace (victimize) block B in shared cache• Inclusion forces replacement of B in private caches
Interconnection network
Core 0
Private cache
Core 1
Private cache
Core 2
Private cacheM, …
S, …
Core 3
Private cache
B: S, …B:
Bank 0 Bank 1{1000} M …A:
Bank 2 Bank 3
{0110} S …B:
A:
Write C
34
Make All Recalls Necessary
Exact state tracking (cover earlier)+
L1/L2 replacement messages (even clean)=
Every recall message finds cached block
Every recall message necessary & occurs after a cache miss (bounded overhead)
35
Make Necessary Recalls Rare
• Recalls naturally rare when Shared Cache Size/ Σ Private Cache sizes > 2
(3) Recalls made rare
Assume misses to random sets [Hill & Smith 1989]
0 1 2 3 4 5 6 7 80
10
20
30
40
50
60
70
80
90
100
1-way2-way4-way8-way
Ratio of Aggregate Private Cache Capacity to Shared Cache Capacity
Perc
enta
ge o
f M
isse
s Ca
usin
g R
ecal
ls
Associativityof Shared
Cache
ExpectedDesignSpace
Core i7
36
OutlineMotivation & Coherence Background
Scalability Challenges1. Communication: Extra bookkeeping messages
(longer section)2. Storage: Extra bookkeeping storage3. Enforcing Inclusion: Extra recall messages4.Latency: Indirection on some requests5.Energy: Dynamic & static overhead
Extension to Non-Inclusive Shared Caches
Criticisms & Summary
37
4. Latency Overhead – Often None
Interconnection network
Core 1
Private cache
Core 2
Private cache
Core C
Private cache
1. None: private hit2. “None”: private miss + “direct” shared cache hit3. “None”: private miss + shared cache miss4. BUT …
Key:Green for RequiredRed for OverheadThin is 8-byte controlThick is 72-byte data
38
4. Latency Overhead -- Some
Interconnection network
Core 1
Private cache
Core 2
Private cache
Core C
Private cache
4. 1.5-2X: private miss + shared cache hit with indirection(s)
• How bad?
Key:Green for RequiredRed for OverheadThin is 8-byte controlThick is 72-byte data
39
4. Latency Overhead -- Indirection
4. 1.5-2X: private miss + shared cache hit with indirection(s)
interconnect + cache + interconnect + cache + interconnect--------------------------------------------------------------------------------------------
-interconnect + cache + interconnect
• Acceptable today• Relative latency similar w/ more cores/hierarchy • Vs. magically having data at shared cache
(4) Latency overhead bounded & scalable
40
5. Energy Overhead• Dynamic -- Small
o Extra message energy – traffic increase small/bounded
o Extra state lookup – small relative to cache block lookup
o …
• Static – Also Smallo Extra state – state increase small/boundedo …
• Little effect on energy-intensive cores, cache data arrays, off-chip DRAM, secondary storage, …
• (5) Energy overhead bounded & scalable
41
OutlineMotivation & Coherence Background
Scalability Challenges1. Communication: Extra bookkeeping messages
(longer section)2. Storage: Extra bookkeeping storage3. Enforcing Inclusion: Extra recall messages (subtle)4. Latency: Indirection on some requests5. Energy: Dynamic & static overhead
Extension to Non-Inclusive Shared Caches (subtle) Apply analysis to caches used by AMD
Criticisms & Summary
42
Review Inclusive Shared Cache
Interconnection network
Core 1
Private cache
Core 2
Private cache
Core C
Private cache
• Inclusive Shared Cache:• Block in a private cache In shared cache• Blocks must be cached redundantly
tracking bits state tag block data
~1 bit per core ~2 bits ~64 bits ~512 bits
43
Non-Inclusive Shared Cache
Interconnection network
Core 1
Private cache
Core 2
Private cache
Core C
Private cache
tracking bits state tag
~1 bit per core ~2 bits ~64 bits
2. InclusiveDirectory
(probe filter) state tag block data
~2 bits ~64 bits ~512 bits
1. Non-InclusiveShared Cache
Any size or associativity Avoids redundant caching
Allows victim caching
Dataless Ensures coherence But duplicates tags
44
Non-Inclusive Shared Cache
• Non-Inclusive Shared Cache: Data Block + Tag(Any Configuration )
• Inclusive Directory: Tag (Again) + State• Inclusive Directory == Coherence State Overhead
• WITH TWO LEVELSo Directory size proportional to sum of private cache sizeso 64b/(48b+512b) * 2 (for rare recalls) = 22% * Σ L1 size
• Coherence overhead higher than w/ inclusion
L2 / ΣL1s
1 2 4 8
Overhead
11% 7.6% 4.6% 2.5%
45
Non-Inclusive Shared Caches
WITH THREE LEVELS• Cluster has L2 cache & cluster directory
o Cluster directory points to cores w/ L1 block (as before)
o (1) Size = 22% * ΣL1s sizes• Chip has L3 cache & global directory
o Global directory points to cluster w/ block ino (2) Cluster directory for size 22% * ΣL1s +o (3) Cluster L2 cache for size 22% * ΣL2s
• Hierarchical overhead higher than w/ inclusion
L3 / ΣL2= L2 / ΣL1s
1 2 4 8
Overhead(1)+(2)+(3
)
23%
13% 6.5% 3.1%
46
OutlineMotivation & Coherence Background
Scalability Challenges1. Communication: Extra bookkeeping messages (longer
section)2. Storage: Extra bookkeeping storage3. Enforcing Inclusion: Extra recall messages (subtle)4. Latency: Indirection on some requests5. Energy: Dynamic & static overhead
Extension to Non-Inclusive Shared Caches (subtle)
Criticisms & Summary
Some Criticisms(1) Where are workload-driven evaluations?
o Focused on robust analysis of first-order effects
(2) What about non-coherent approaches? o Showed compatible of coherence scales
(3) What about protocol complexity?o We have such protocols today (& ideas for better ones)
(4) What about multi-socket systems?o Apply non-inclusive approaches
(5) What about software scalability?o Hard SW work need not re-implement coherence
48
Executive Summary• Today chips provide shared memory w/ HW coherence
as low-level support for OS & application SW• As #cores per chip scales?
o Some argue HW coherence gone due to growing overheadso We argue it’s stays by managing overheads
• Develop scalable on-chip coherence proof-of-concepto Inclusive caches firsto Exact tracking of sharers & replacements (key to analysis)o Larger systems need to use hierarchy (clusters)o Overheads similar to today’s
Compatibility of on-chipHW coherence is here to stay
Let’s spend programmer sanity on parallelism,not lost compatibility!
49
Coherence NOT this Awkward