Snoop Filtering and Coarse-Grain Memory Tracking

Snoop Filtering and

Coarse-Grain Memory TrackingAndreas MoshovosUniv. of Toronto/ECE

Short Course at the University of Zaragoza, July 2009Some slides by J. Zebchuk or the original paper authors

JETTY Snoop-Filtering for Reduced Power in SMP

ServersAndreas Moshovos

Babak Falsafi, ECE, Carnegie MellonGokhan Memik, ECE, Northwestern

Alok Choudhary, ECE, Northwestern

Int’l Conference on High-Performance Architecture, 2001

Power is Becoming Important• Architecture is a science of tradeoffs• Thus far:

Performance vs. Cost vs. Complexity• Today:

vs. Power

• Where?– Mobile Devices– Desktops/Servers Our Focus

Power-Aware Servers• Revisit the design of SMP servers

– 2 or more CPUs per machine– Snoop coherence-based

• Why?– File, web, databases, your typical desktop– Cost effective too

• This work - a first step:Power-Aware Snoopy-Coherence

Power-Aware Snoop-Coherence• Conventional

– All L2 caches snoop all memory traffic– Power expended by all on any memory access

• Jetty-Enhanced– Tiny structure on L2-backside– Filters most “would-be-misses”– Less power expended on most snoop misses– No changes to protocol necessary– No performance loss

Roadmap• Why Power is a Concern for Servers?• Snoopy-Coherence Basics• An Opportunity for Reducing Power• JETTY• Results• Summary

Why is Power Important?Power Could Ultimately Limit Performance

• Power Demands have been increasing• Deliver Energy to and on chip• Dissipate Heat• Limit:

– Amount of resources & frequency– Feasibility

• Cooling a solution: Cost & Integration?Reducing Power Demands is much more convenient

What can be done?• Redesign Circuits• Clock Gating and Frequency Scaling

– A lot has been done thus far– Still active

• Rethink Architectural Decisions– Orthogonal to others

Reduce Power Under Performance Constraints

The “Silver Bullet” Solution• Good if there was one• However, till one is found...

• Look at all structures• Rethink Design• Propose Power-Optimized versions

• This is what we’re doing for performance

Snoopy Cache Coherence

All L2 tags see all bus accessesIntervene when necessary

Main Memory

CPU Core

How About Power?

All L2 tags see all bus accessesPerf. & Complexity: Have L2 tags why not use themPower: All L2 tags consume power on all accesses

Main Memory

CPU CoreCPU Core CPU Core

missmiss

JETTY: A Would be Snoop-Miss Filter

Imprecise: May filter a would-be miss Never filters snoop-hits

Not here!

Would be Snoop-Miss:

Don’t Know

Would be Snoop-Hit:

Detect most misses using fewer resources

Potential for Savings Exist

• Most Snoops miss– 91% AVG

• Many L2 accesses are due to Snoop Misses– 55% AVG

• Sizeable Potential Power Savings:– 20% - 50% of total L2 power

Exclude-Jetty

• Subset of what is not cached

cached

not cached

How? Cache recent snoop-misses locally

ExcludeJETTY

Exclude-Jetty

• Subset of what you don’t have

Works well for producer-consumer

Include-Jetty

• Superset of what is cached

cached

not cached

How? Well...

includeJETTY

Include-Jetty

address

bit vector 0

bit vector 1

bit vector 2

• Not-CachedAny zero bit

• May be CachedAll bits set

Later I was told this is a Bloom filter…

Include-Jetty

• Superset of what you have

This is a counting bloom filter:L-CBF: A Low Power, Fast Counting Bloom Filter ImplementationElham Safi, Andreas Moshovos and Andreas Veneris,In Proc. Annual International Symposium on Low Power Electronics and Design (ISLPED), Oct. 2006.

Partial overlapping indexes worked better

Hybrid-Jetty• Some cases Exclude-J works well• Some other Include-J is better• Combine

– Access in parallel on snoop– Allocation

• IJ always• If IJ fails to filter then to EJ• EJ coverage increases

Latency?• Jetty may increase snoop-response time• Can only be determined on a design by design basis• Largest Jetty:

– Five 32x32 bit register files

Results• Used SPLASH-II

– Scientific applications– “Large” Datasets

• e.g., 4-80Megs of main memory allocated• Access Counts: 60M-1.7B

– 4-way SMP, MOESI– 1M direct-mapped L2, 64b 32b subblocks– 32k direct-mapped L1, 32b blocks

• Coverage & Power (analytical model)

Coverage: Hybrid-Jetty

• Can capture 74% of all snoop-misses

ba ch em ff fm lu oc ra rt un AVG10x4x7 + 32x4 9x4x7 + 32x4 8x4x7 + 32x4

Power-Savings

• 28% of overall L2 power

ba ch em ff fm oc ra rt un AVG

Summary• Power is becoming important

– Performance, Reliability and Feasibility• Unique Opportunities Exist for Servers

• JETTY: Filter Snoops that would miss– 74% of all snoops– 28% of L2 power saved– No protocol changes– No performance loss

Power efficient cache coherence

C. Saldanha, M. LipastiWorkshop on Memory Performance Issues

(in conjunction with ISCA), June 2001.

MEMORY

Serial Snooping• Avoids Speculative transmission of Snoop packets.• Check the nearest neighbor• Data supplied with minimum latency and power

TLB and Snoop Energy-Reduction using Virtual Caches in

Low-Power Chip-Multiprocessors

Magnus Ekman, *Fredrik Dahlgren, and Per Stenström

Chalmers University of TechnologyEricsson Mobile Platforms

Int’l Symposium on Low Power Electronic Design and Devices, Aug. 2002

Page Sharing Tables• On snoop requesting node gets a page-level sharing vector

Paper by same authors demonstrates the Jetty is not beneficial for small-scale CMPs

If a PST entry is evicted the whole page must be evicted

RegionScout: Exploiting Coarse Grain Sharing in Snoop

Coherence

Andreas Moshovosmoshovos@eecg.toronto.edu

Int’l Conference on Computer Architecture 2005

interconnect

Main Memory

Improving Snoop Coherence

Conventional Considerations: Complexity and Correctness NOT Power/Bandwidth

Can we: (1) Reduce Power/bandwidth (2) Leverage snoop coherence? Remains Attractive: Simple / Design Re-use

Yes: Exploit Program Behavior toDynamically Identify Requests that do not Need Snooping

interconnect

Main Memory

RegionScout: Avoid Some Snoops

• Frequent case: non-sharing even at a coarse level/Region• RegionScout: Dynamically Identify Non-Shared Regions

– First Request to a Region Identifies it as not Shared– Subsequent Requests do not need to be broadcast

• Uses Imprecise Information– Small structures– Layer on top of conventional coherence– No additional constraints

Roadmap• Conventional Coherence:

– The need for power-aware designs

• Potential: Program Behavior

• RegionScout: What and How

• Implementation

• Evaluation

• Summary

Coherence Basics

• Given request for memory block X (address)• Detect where its current value resides

Main Memory

snoopsnoop

CPU CPU CPU

Conventional Coherence not Power-Aware/Bandwidth-Effective

All L2 tags see all accessesPerf. & Complexity: Have L2 tags why not use themPower: All L2 tags consume power on all accessesBandwidth: broadcast all coherent requests

Main Memory

missmiss

CPU CPU

RegionScout Motivation: Sharing is Coarse

• Region: large continuous memory area, power of 2 size• CPU X asks for data block in region R

1. No one else has X2. No one else has any block in R

RegionScout Exploits this BehaviorLayered Extension over Snoop Coherence

Typical Memory Space Snapshot: colored by owner(s)

addresses

Optimization Opportunities

• Power and Bandwidth– Originating node: avoid asking others– Remote node: avoid tag lookup

Memory

SWITCH

Potential: Region Miss Frequency

256 512 1K 2K 4K 8K 16K

p4.512K

p8.512K

Region Size

Even with a 16K Region~45% of requests miss in all remote nodes

RegionScout at Work: Non-Shared Region Discovery

First request detects a non-shared region

Main Memory

CPUCPU CPU

Global Region Miss

Region Miss Region Miss12 2

Record: Non-Shared Regions Record: Locally Cached Regions

RegionScout at Work: Avoiding Snoops

Subsequent request avoids snoops

Main Memory

CPUCPU CPU

Global Region Miss

RegionScout is Self-Correcting

Request from another node invalidates non-shared record

Main Memory

CPUCPU CPU

• Requesting Node provides address:

• At Originating Node – from CPU: – Have I discovered that this region is not shared?

• At Remote Nodes – from Interconnect: – Do I have a block in the region?

Implementation: Requirements

Region Tag offsetlg(Region Size)

address

Remembering Non-Shared Regions

• Records non-shared regions• Lookup by Region portion prior to issuing a request• Snoop requests and invalidate

Region Tag offsetaddress

validNon-Shared Region Table

Few entries16x4 in most experiments

What Regions are Locally Cached?

• If we had as many counters as regions:– Block Allocation: counter[region]++– Block Eviction: counter[region]--– Region cached only if counter[region] non-zero

• Not Practical:– E.g., 16K Regions and 4G Memory 256K counters

Region Tag offset

counter

Moshovos ©What Regions are Locally Cached?

• Use few Counters Imprecise: – Records a superset of locally cached Regions– False positives: lost opportunity, correctness preserved

Region Tag offset

counter

hashCached Region Hash

“Counter”: + on block allocation - on block evictionFew entries, e.g., 256

p bits

P-bit 1 if counter non-zero used for lookups

Moshovos ©Roadmap• Conventional Coherence

• Program Behavior: Region Miss Frequency

• RegionScout

• Evaluation

• Summary

Moshovos ©Evaluation Overview• Methodology

• Filter rates– Practical Filters can capture many Region Misses

• Interconnect bandwidth reduction

Moshovos ©Methodology• In-House simulator based on Simplescalar

– Execution driven– All instructions simulated – MIPS like ISA– System calls faked by passing them to host OS– Synchronization using load-linked/store-conditional– Simple in-order processors– Memory requests complete instantaneously– MESI snoop coherence– 1 or 2 level memory hierarchy– WATTCH power models

• SPLASH II benchmarks– Scientific workloads– Feasibility study

Moshovos ©Filter Rates

256 512 1K 2K

p4.512K.R4K

p4.512K.R16K

p8.512K.R4K

p8.512K.R16KIden

CRH Size

For small CRH better to use large regionsPractical RegionScout filters capture a lot of the potential

Moshovos ©Bandwidth Reduction

2K 4K 8K 16K

p4.512K

p8.512K

p4.64K

p8.64K

Region Size

Moderate Bandwidth Savings for SMP (15%-22%)More so for CMP (>25%)

Moshovos ©Related Work• RegionScout

– Technical Report, Dec. 2003

• Jetty– Moshovos, Memik, Falsafi, Choudhary, HPCA 2001

• PST– Eckman, Dahlgren, and Stenström, ISLPED 2002

• Coarse-Grain Coherence– Cantin, Lipasti and Smith, ISCA 2005

Moshovos ©

Summary• Exploit program behavior/optimize a frequent case

– Many requests result in a global region miss

• RegionScout– Practical filter mechanism– Dynamically detect would-be region misses– Avoid broadcasts– Save tag lookup power and interconnect bandwidth – Small structures– Layered extension over existing mechanisms– Invisible to programmer and the OS

Coarse-Grain Coherence

J. Cantin, M. Lipasti and J. E. SmithISCA 2005

Coarse-Grain Coherence• Exploits the same phenomenon as RegionScout• Protocol extended to keep track of region state as well

– Additional optimizations• Uses an additional region tag array to do so• Region replacements

– Must scan and find the block and evict them

Flexible snooping: adaptive forwarding and filtering of snoops in embedded-ring

multiprocessorsK. Strauss, X. Shen, J. Torrellas

International Symposium on Computer Architecture, June 2006.

Karin Strauss

Flexible

Snoopi

Predictors and algorithms

snoopforwardExact

forward then snoop

Aggforward

snoopforward then snoop

Subset

action on positive prediction

action on negative prediction

predictor / algorithm

Superset

Con snoop then forward

node can supply

in predictor

set of addresses:

Ring-specific

Karin Strauss

Flexible

Snoopi

Predictor implementation

• Subset– associative table:

subset of addresses that can be supplied by node

• Superset– bloom filter: superset of addresses that can be supplied by node– associative table (exclude cache):

addresses that recently suffered false positives

• Exact– associative table: all addresses that can be supplied by node– downgrading: if address has to be evicted from predictor table,

corresponding line in node has to be downgraded

Design and Implementation of the Blue Gene/P Snoop Filter

Valentina Salapura, Matthias Blumrich, Alan Gara

Int’l Conf. on High-Performance Computer Architecture, 2008

Three Mechanisms• Stream registers

– Contiguous data areas– Adaptive to cache arbitrarily sized contiguous regions with a single

register– Stream registers track strided and sequential streams

• Snoop caches– Cache of recently executed snoop requests– Multiple requests to same line do not have to cause multiple

snoop lookups– Snoop caches track locality

• Range filter– Identify regions of known non-shared data– Configured by software

Stream Registers• Base = where the block starts• Mask = which bits are common

– Example: base 0111 mask 1101 01X1 may be in the cache• Over time Mask becomes all zeros• How to reset?• Cache Wrap

– Each set uses Round-Robin replacement– Count replacements per set– Cache wrap when all counters > ways– Copy all streams to history and use combination– Next time throw out history

Stream Registers: An Example• Direct mapped cache with two blocks

• At this point the filter reports that the cache contains:– 001 and 011– 101 and 111

• The first two are not there• Eventually the filter becomes

saturated and can filter much• How can we get rid of the 011 /

001 / 111empty

001 / 1X1empty

001 / 111

101 / 111

001 / 1X1

101 / 1X1

cache Stream registers

Avoiding Saturation: Exploiting Cache Warping

001 / 111empty

001 / 1X1empty

empty101 / 111

101 / 1X1

ecache Stream registers

001 / 1X1empty

001 / 1X1

Shadow

Cache Warp Can discard Shadow

Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip

MultiprocessorsChinnakrishnan S. Ballapuram

Ahmad SharifHsien-Hsin S. Lee

ASPLOS 2008

Software-Hardware Hybrid• Software Directs hardware what to do

– Mechanisms very similar to Jetty and RegionScout

• Paper incorrectly states that:– Jetty does not work for CMPs

• It does not work well for small scale CMPs– RegionScout works only for busses

• Is interconnect agnostic

RegionTracker: A Framework for Coarse-Grain Optimizations in the On-chip

Memory HierarchyJason Zebchuk, Elham Safi and Andreas Moshovos

Int’l Symposium on Microarchitecture, 2007

EPFL, Jan. 2008

66Aenao Group/Toronto

Future Caches: Just Larger?

interconnect

Main Memory

1. “Big Picture” Management2. Store Metadata

10s – 100s of MB

EPFL, Jan. 2008

Conventional Block Centric Cache

• “Small” Blocks– Optimizes Bandwidth and Performance

• Large L2/L3 caches especially

Fine-Grain View of Memory

L2 Cache

Big Picture Lost

EPFL, Jan. 2008

“Big Picture” View

• Region: 2n sized, aligned area of memory• Patterns and behavior exposed

– Spatial locality

• Exploit for performance/area/power

Coarse-Grain View of Memory

L2 Cache

EPFL, Jan. 2008

Exploiting Coarse-Grain Patterns

• Many existing coarse-grain optimizations• Add new structures to track coarse-grain information

L2 Cache

Stealth Prefetching

Run-time Adaptive Cache Hierarchy Management via Reference Analysis

Destination-Set Prediction

Spatial Memory Streaming

Coarse-Grain Coherence Tracking

RegionScout

Circuit-Switched Coherence

Hard to justify for a commercial design

Coarse-Grain Framework

Embed coarse-grain information in tag array

Support many different optimizations with less area overhead

Adaptable optimization FRAMEWORK

Virtual Tree CoherencePower-Efficient DRAMSpeculation

EPFL, Jan. 2008

L2 Cache

RegionTracker Solution

Manage blocks, but also track and manage regions

Tag Array

Data Array

Data Blocks

BlockRequests

Block Requests

RegionTracker

RegionProbes

RegionResponses

EPFL, Jan. 2008

RegionTracker Summary• Replace conventional tag array:

– 4-core CMP with 8MB shared L2 cache– Within 1% of original performance– Up to 20% less tag area– Average 33% less energy consumption

• Optimization Framework:– Stealth Prefetching: same performance, 36% less area– RegionScout: 2x more snoops avoided, no area overhead

EPFL, Jan. 2008

Road Map

• Introduction

• Goals

• Coarse-Grain Cache Designs

• RegionTracker: A Tag Array Replacement

• RegionTracker: An Optimization Framework

• Conclusion

EPFL, Jan. 2008

Goals1. Conventional Tag Array Functionality

– Identify data block location and state– Leave data array un-changed

2. Optimization Framework Functionality– Is Region X cached?– Which blocks of Region X are cached? Where?– Evict or migrate Region X– Easy to assign properties to each Region

EPFL, Jan. 2008

Coarse-Grain Cache Designs

• Increased BW, Decreased hit-rates

Region X

Large Block SizeTag Array Data Array

EPFL, Jan. 2008

Sector Cache

• Decreased hit-rates

Region X

Tag Array Data Array

EPFL, Jan. 2008

Sector Pool Cache

• High Associativity (2 - 4 times)

Region X

Tag Array Data Array

EPFL, Jan. 2008

Decoupled Sector Cache

• Region information not exposed• Region replacement requires scanning multiple

entries

Region X

Tag Array Data ArrayStatus Table

EPFL, Jan. 2008

Design Requirements• Small block size (64B)• Miss-rate does not increase• Lookup associativity does not increase• No additional access latency

– (i.e., No scanning, no multiple block evictions)

• Does not increase latency, area, or energy• Allows banking and interleaving

• Fit in conventional tag array “envelope”

EPFL, Jan. 2008

RegionTracker: A Tag Array Replacement

Data Array

• 3 SRAM arrays, combined smaller than tag array

RegionVectorArray

BlockStatusTable

EvictedRegionBuffer

EPFL, Jan. 2008

Common Case: Hit

Region Tag RVA Index Region Offset Block Offset49 061021

Address:

Region Vector Array(RVA)

Region Tag ……

block0

block15

Block Offset19 6 0

Block Status Table(BST)

status

Data Array + BST Index

To Data Array

Ex: 8MB, 16-way set-associative cache, 64-byte blocks, 1KB region

EPFL, Jan. 2008

Worst Case (Rare): Region Miss

Region Tag RVA Index Region Offset Block Offset

49 061021

Address:

Region Vector Array(RVA)

Region Tag ……

block0

block15

Block Offset19 6 0

Block Status Table(BST)

status

Data Array + BST Index

EvictedRegionBuffer(ERB)No

Match!

Methodology• Flexus simulator from CMU SimFlex group

– Based on Simics full-system simulator• 4-core CMP modeled after Piranha

– Private 32KB, 4-way set-associative L1 caches– Shared 8MB, 16-way set-associative L2 cache– 64-byte blocks

• Miss-rates: Functional simulation of 2 billion instructions per core• Performance and Energy: Timing simulation using SMARTS sampling methodology• Area and Power: Full custom implementation on 130nm commercial technology• 9 commercial workloads:

– WEB: SpecWEB on Apache and Zeus– OLTP: TPC-C on DB2 and Oracle– DSS: 5 TPC-H queries on DB2

Interconnect

PD$ I$

Miss-Rates vs. Area

• Sector Cache: 512KB sectors, SPC and RT: 1KB regions• Trade-offs comparable to conventional cache

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2

Sector Pool Cache

RegionTracker

Conventional Tags

better

Relative Tag Array Area

Sector Cache (0.25, 1.26)

14-way 15-way

52-way

48-way

84Aenao Group/TorontoEPFL, Jan. 2008

Performance & Energy

WEB OLTP DSS0%

WEB OLTP DSS

• 12-way set-associative RegionTracker: 20% less area• Error bars: 95% confidence interval

• Performance within 1%, with 33% tag energy reduction

better

rgy better

Performance Energy

Road Map

• Introduction

• Goals

• Coarse-Grain Cache Designs

• RegionTracker: A Tag Array Replacement

• RegionTracker: An Optimization Framework

• Conclusion

RegionTracker: An Optimization Framework

Data Array

Stealth Prefetching:Average 20% performance improvementDrop-in RegionTracker for 36% less area overhead

RegionScout:In-depth analysis

Snoop Coherence: Common Case

Main Memory

CPU CPU CPURead x

missmiss

Read x+1Read x+2Read x+n

Many snoops are to non-shared regions

RegionScout

Eliminate broadcasts for non-shared regions

Main Memory

CPUCPU CPU

Global Region Miss

Region Miss

Non-Shared Regions Locally Cached Regions

Read x

RegionMiss

MissMiss

RegionTracker Implementation

• Minimal overhead to support RegionScout optimization

• Still uses less area than conventional tag array

Non-Shared Regions

Add 1 bit to each RVA entry

Locally Cached Regions

Already provided by RVA

RegionTracker + RegionScout

RS 7KB RS 12KB RS 22KB RSRT0%

50%HMEAN

better

4 processors, 512KB L2 Caches 1KB regions

Avoid 41% of Snoop Broadcasts,no area overhead compared to conventional tag array

EPFL, Jan. 2008

Result Summary• Replace Conventional Tag Array:

– 20% Less tag area– 33% Less tag energy– Within 1% of original performance

• Coarse-Grain Optimization Framework:– 36% reduction in area overhead for Stealth Prefetching– Filter 41% of snoop broadcasts with no area overhead compared

to conventional cache

Snoop Filtering and Coarse-Grain Memory Tracking

Documents

Transcript of Snoop Filtering and Coarse-Grain Memory Tracking

Snoopy - greeting cards · Snoopy - greeting cards Card Size 120mm x 170mm. Each card comes individually wrapped with an envelope. SNOOP 9 SNOOP 10 SNOOP 11 SNOOP 12 SNOOP 5 SNOOP

People snoop

Lecture 2. Snoop-based Cache Coherence Protocols

Snoop Dog G Pen | On Sale Today | DreamsSmokeShop.com

SNOOP Policy Forum April 12 th 2013 Tallinn

Dhcp Snoop

Snoop Lion

Physical Design of Snoop-Based Cache Coherence in Multiprocessors Muge Guher.

Street Motivation Magazine - VOL. 29 SNOOP DOGG & LBC MOVEMENT

In-Network Snoop Ordering (INSO): Snoopy Coherence on ...The interconnect assigns distinct, but globally-ordered num-bers (e.g., 0, 1, 2, ..) to the snoop requests. This number is

Http:// parents-snoop-on-their-kids-online

Design and Implementation of the Blue Gene/P Snoop Filter

Vodafone lets snoop cat out, says India STAY CONNECTED ...pavanduggal.com/wp-content/uploads/2015/01/Vodafone-lets-snoop … · resume 3G services in all circles TOP NEWSMOST READEDITOR'S

Snoop Troop Activity Kit

Soa Negocios Snoop

Music Video Analysis - Snoop Dogg - Drop It Like It's Hot

Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors

Moshovos © 1 RegionScout: Exploiting Coarse Grain Sharing in Snoop Coherence Andreas Moshovos moshovos@eecg.toronto.edu .

Msds Snoop

Snoop Dogg, right, is an NIU orientation leader. PAge 11 ...bloximages.newyork1.vip.townnews.com/northernstar... · Snoop Dogg, right, is an NIU orientation leader. goalie god Jordan