Interactions Between Compression and Prefetching in Chip Multiprocessors Alaa R. Alameldeen* David...

39
Interactions Between Compression and Prefetching in Chip Multiprocessors Alaa R. Alameldeen* David A. Wood Intel Corporation University of Wisconsin-Madison *Work largely done while at UW-Madison

Transcript of Interactions Between Compression and Prefetching in Chip Multiprocessors Alaa R. Alameldeen* David...

Page 1: Interactions Between Compression and Prefetching in Chip Multiprocessors Alaa R. Alameldeen* David A. Wood Intel CorporationUniversity of Wisconsin-Madison.

Interactions Between Compression and Prefetching in Chip Multiprocessors

Alaa R. Alameldeen* David A. Wood

Intel Corporation University of

Wisconsin-Madison

*Work largely done while at UW-Madison

Page 2: Interactions Between Compression and Prefetching in Chip Multiprocessors Alaa R. Alameldeen* David A. Wood Intel CorporationUniversity of Wisconsin-Madison.

2

Overview Addressing memory wall in CMPs

Compression Cache Compression: Increases effective cache size

– Increases latency, uses additional cache tags Link Compression: Increases effective pin bandwidth

Hardware Prefetching+ Decreases average cache latency?– Increases bandwidth demand contention– Increases working set size cache pollution

Contribution #1: Adaptive Prefetching Uses additional tags (like cache compression) Avoids useless and harmful prefetches

Contribution #2: Prefetching&compression interact positively e.g., Link compression helps prefetching’s bandwidth demand

Page 3: Interactions Between Compression and Prefetching in Chip Multiprocessors Alaa R. Alameldeen* David A. Wood Intel CorporationUniversity of Wisconsin-Madison.

3

Outline

Overview

Compressed CMP Design

– Compressed Cache Hierarchy– Link Compression

Adaptive Prefetching

Evaluation

Conclusions

Page 4: Interactions Between Compression and Prefetching in Chip Multiprocessors Alaa R. Alameldeen* David A. Wood Intel CorporationUniversity of Wisconsin-Madison.

4

Compressed Cache Hierarchy

Shared L2 Cache (Compressed)Shared L2 Cache (Compressed)

CompressionCompressionDecompressionDecompression

UncompressedUncompressedLineLine

BypassBypass

Processor 1Processor 1

L1 Cache L1 Cache (Uncompressed)(Uncompressed)

Memory Controller (Could Memory Controller (Could compress/decompress)compress/decompress)

…………………………………………

Processor nProcessor n

L1 Cache L1 Cache (Uncompressed)(Uncompressed)

To/From other chips/memoryTo/From other chips/memory

Page 5: Interactions Between Compression and Prefetching in Chip Multiprocessors Alaa R. Alameldeen* David A. Wood Intel CorporationUniversity of Wisconsin-Madison.

5

Cache Compression: Decoupled Variable-Segment Cache (ISCA 2004)

Example cache set

Addr B compressed 2 Addr B compressed 2

Data AreaData Area

Addr A uncompressed 3Addr A uncompressed 3

Addr C compressed 6Addr C compressed 6

Addr D compressed 4Addr D compressed 4

Tag AreaTag Area

Compression StatusCompression Status Compressed SizeCompressed SizeTag is present Tag is present but line isn’tbut line isn’t

Page 6: Interactions Between Compression and Prefetching in Chip Multiprocessors Alaa R. Alameldeen* David A. Wood Intel CorporationUniversity of Wisconsin-Madison.

6

Link Compression

On-chip Memory Controller transfers compressed messages

Data Messages

– 1-8 sub-messages (flits), 8-bytes each

Off-chip memory controller combines flits and stores to memory

Memory ControllerMemory Controller

Memory ControllerMemory Controller

Processors / L1 CachesProcessors / L1 Caches

L2 CacheL2 Cache

CMPCMP

To/From MemoryTo/From Memory

Page 7: Interactions Between Compression and Prefetching in Chip Multiprocessors Alaa R. Alameldeen* David A. Wood Intel CorporationUniversity of Wisconsin-Madison.

7

Outline

Overview

Compressed CMP Design

Adaptive Prefetching

Evaluation

Conclusions

Page 8: Interactions Between Compression and Prefetching in Chip Multiprocessors Alaa R. Alameldeen* David A. Wood Intel CorporationUniversity of Wisconsin-Madison.

8

Hardware Stride-Based Prefetching

Evaluation based on IBM Power 4 implementation

Stride-based prefetching

+ L1 prefetching hides L2 latency+ L2 prefetching hides memory latency

- Increases working set size cache pollution

- Increases on-chip & off-chip bandwidth demand contention

Prefetching Taxonomy [Srinivasan et al. 2004]:

– Useless prefetches increase traffic not misses– Harmful prefetches replace useful cache lines

Page 9: Interactions Between Compression and Prefetching in Chip Multiprocessors Alaa R. Alameldeen* David A. Wood Intel CorporationUniversity of Wisconsin-Madison.

9

Prefetching in CMPs

0

20

40

60

80

100

120

140

160

180

200

1p 2p 4p 8p 16p

Sp

ee

du

p (

%)

No PFStride-Based PFAdaptive PF

Zeus

Page 10: Interactions Between Compression and Prefetching in Chip Multiprocessors Alaa R. Alameldeen* David A. Wood Intel CorporationUniversity of Wisconsin-Madison.

10

Prefetching in CMPs

0

20

40

60

80

100

120

140

160

180

200

1p 2p 4p 8p 16p

Sp

ee

du

p (

%)

No PFStride-Based PFAdaptive PF

Can we avoid useless & harmful prefetches?

Prefetching Hurts

Performance

Zeus

Page 11: Interactions Between Compression and Prefetching in Chip Multiprocessors Alaa R. Alameldeen* David A. Wood Intel CorporationUniversity of Wisconsin-Madison.

11

Prefetching in CMPs

0

20

40

60

80

100

120

140

160

180

200

1p 2p 4p 8p 16p

Sp

ee

du

p (

%)

No PFStride-Based PFAdaptive PF

Adaptive prefetching avoids hurting performance

Zeus

Page 12: Interactions Between Compression and Prefetching in Chip Multiprocessors Alaa R. Alameldeen* David A. Wood Intel CorporationUniversity of Wisconsin-Madison.

12

Adaptive Prefetching

Addr B compressed 2Addr B compressed 2

Data AreaData Area

Addr A uncompressed 3 Addr A uncompressed 3

Addr C compressed 6Addr C compressed 6

Addr D compressed 4Addr D compressed 4

Tag AreaTag Area

Compression StatusCompression Status Compressed SizeCompressed SizeTag is present Tag is present but line isn’tbut line isn’t

PF Bit: Set on PF line allocation, reset on accessPF Bit: Set on PF line allocation, reset on access

How do we identify useful, useless, harmful prefetches?

Page 13: Interactions Between Compression and Prefetching in Chip Multiprocessors Alaa R. Alameldeen* David A. Wood Intel CorporationUniversity of Wisconsin-Madison.

13

Addr B compressed 2Addr B compressed 2

Useful Prefetch

Addr A uncompressed 3 Addr A uncompressed 3

Addr C compressed 6Addr C compressed 6

Addr D compressed 4Addr D compressed 4

Read/Write Address B

PF bit set PF used before replacement

Access resets PF bit

Page 14: Interactions Between Compression and Prefetching in Chip Multiprocessors Alaa R. Alameldeen* David A. Wood Intel CorporationUniversity of Wisconsin-Madison.

14

Addr E compressed 6Addr E compressed 6

Addr B compressed 2Addr B compressed 2

Useless Prefetch

Addr A uncompressed 3 Addr A uncompressed 3

Addr C compressed 6Addr C compressed 6

Addr D compressed 4Addr D compressed 4

Replace Address C with Address E

PF bit set PF unused before replacement

PF did not change #misses, only increased traffic

Page 15: Interactions Between Compression and Prefetching in Chip Multiprocessors Alaa R. Alameldeen* David A. Wood Intel CorporationUniversity of Wisconsin-Madison.

15

Addr B compressed 2Addr B compressed 2

Harmful Prefetch

Addr A uncompressed 3 Addr A uncompressed 3

Addr E compressed 6Addr E compressed 6

Addr D compressed 4Addr D compressed 4

Read/Write Address D

Tag found Could have been a hit

Heuristic: If PF bit of other line set Avoidable miss

Prefetch(es) incurred an extra miss

Page 16: Interactions Between Compression and Prefetching in Chip Multiprocessors Alaa R. Alameldeen* David A. Wood Intel CorporationUniversity of Wisconsin-Madison.

16

Adaptive Prefetching : Summary

Uses extra compression tags & PF bit per tag

– PF bit set when prefetched line is allocated, reset on access– Extra tags still maintain address tag, status

Use counter for #startup prefetches per cache: 0 – MAX

– Increment for useful prefetches– Decrement for useless & harmful prefetches

Classification of Prefetches:

– Useful Prefetch: PF bit set on a cache hit– Useless Prefetch: Replacing a line whose PF bit is set– Harmful Prefetch: Cache miss matches invalid line’s tag, valid

line tag has PF bit set

Page 17: Interactions Between Compression and Prefetching in Chip Multiprocessors Alaa R. Alameldeen* David A. Wood Intel CorporationUniversity of Wisconsin-Madison.

17

Outline

Overview

Compressed CMP Design

Adaptive Prefetching

Evaluation

– Workloads and System Configuration– Performance– Compression and Prefetching Interactions

Conclusions

Page 18: Interactions Between Compression and Prefetching in Chip Multiprocessors Alaa R. Alameldeen* David A. Wood Intel CorporationUniversity of Wisconsin-Madison.

18

Simulation Setup

Workloads:

– Commercial workloads [Computer’03, CAECW’02] : • OLTP: IBM DB2 running a TPC-C like workload • SPECJBB• Static Web serving: Apache and Zeus

– SPEComp benchmarks:• apsi, art, fma3d, mgrid

Simulator:

– Simics full system simulator; augmented with:– Multifacet General Execution-driven Multiprocessor Simulator

(GEMS) [Martin, et al., 2005, http://www.cs.wisc.edu/gems/]

Page 19: Interactions Between Compression and Prefetching in Chip Multiprocessors Alaa R. Alameldeen* David A. Wood Intel CorporationUniversity of Wisconsin-Madison.

19

System Configuration 8-core CMP

Cores: single-threaded, out-of-order superscalar with an 11-stage pipeline, 64 entry IW, 128 entry ROB, 5 GHz clock frequency

L1 Caches: 64K instruction, 64K data, 4-way SA, 320 GB/sec total on-chip bandwidth (to/from L1), 3-cycle latency

Shared L2 Cache: 4 MB, 8-way SA (uncompressed), 15-cycle uncompressed latency

Memory: 400 cycles access latency, 20 GB/sec memory bandwidth

Prefetching

– 8 unit/negative/non-unit stride streams for L1 and L2 for each processor– Issue 6 L1 prefetches (max) on L1 miss – Issue 25 L2 prefetches (max) on L2 miss

Page 20: Interactions Between Compression and Prefetching in Chip Multiprocessors Alaa R. Alameldeen* David A. Wood Intel CorporationUniversity of Wisconsin-Madison.

20

Performance

0

20

40

60

80

100

120

140

160

Sp

eed

up

(%

)

No PF or Compr

Always PF

Compr

Both

Page 21: Interactions Between Compression and Prefetching in Chip Multiprocessors Alaa R. Alameldeen* David A. Wood Intel CorporationUniversity of Wisconsin-Madison.

21

Performance

0

20

40

60

80

100

120

140

160

Sp

eed

up

(%

)

No PF or Compr

Always PF

Compr

Both

Page 22: Interactions Between Compression and Prefetching in Chip Multiprocessors Alaa R. Alameldeen* David A. Wood Intel CorporationUniversity of Wisconsin-Madison.

22

Performance

0

20

40

60

80

100

120

140

160

Sp

eed

up

(%

)

No PF or Compr

Always PF

Compr

Both

Page 23: Interactions Between Compression and Prefetching in Chip Multiprocessors Alaa R. Alameldeen* David A. Wood Intel CorporationUniversity of Wisconsin-Madison.

23

Performance

0

20

40

60

80

100

120

140

160

Sp

eed

up

(%

)

No PF or Compr

Always PF

Compr

Both

Significant Improvements when Combining Prefetching and Compression

Page 24: Interactions Between Compression and Prefetching in Chip Multiprocessors Alaa R. Alameldeen* David A. Wood Intel CorporationUniversity of Wisconsin-Madison.

24

Performance: Adaptive Prefetching

0

20

40

60

80

100

120

140

160S

pe

ed

up

(%

)

No PF

Always PF

Adaptive PF

Adaptive PF avoids significant performance losses

Page 25: Interactions Between Compression and Prefetching in Chip Multiprocessors Alaa R. Alameldeen* David A. Wood Intel CorporationUniversity of Wisconsin-Madison.

25

Compression and Prefetching Interactions

Positive Interactions:

– L1 prefetching hides part of decompression overhead– Link compression reduces increased bandwidth

demand because of prefetching– Cache compression increases effective L2 size, L2

prefetching increases working set size

Negative Interactions:

– L2 prefetching and L2 compression can eliminate the same misses

Is the overall interaction positive or negative?

Page 26: Interactions Between Compression and Prefetching in Chip Multiprocessors Alaa R. Alameldeen* David A. Wood Intel CorporationUniversity of Wisconsin-Madison.

26

Interactions Terminology

Assume a base system S with two architectural enhancements A and B, All systems run program P

Speedup(A) = Runtime(P, S) / Runtime(P, A)

Speedup(B) = Runtime(P, S) / Runtime (P, B)

Speedup(A, B) = Speedup(A) x Speedup(B)

x (1 + Interaction(A,B))

Page 27: Interactions Between Compression and Prefetching in Chip Multiprocessors Alaa R. Alameldeen* David A. Wood Intel CorporationUniversity of Wisconsin-Madison.

27

Interaction Between Prefetching and Compression

-5

0

5

10

15

20

25

Inte

ract

ion (

%)

apache zeus oltp jbb art apsi fma3d mgrid

Interaction is positive for all benchmarks except apsiInteraction is positive for all benchmarks except apsi

Page 28: Interactions Between Compression and Prefetching in Chip Multiprocessors Alaa R. Alameldeen* David A. Wood Intel CorporationUniversity of Wisconsin-Madison.

28

Conclusions

Compression can help CMPs– Cache compression improves CMP performance – Link Compression reduces pin bandwidth demand

Hardware prefetching can hurt performance– Increases cache pollution– Increases bandwidth demand

Contribution #1: Adaptive prefetching reduces useless & harmful prefetches– Gets most of benefit from prefetching– Avoids large performance losses

Contribution #2: Compression&prefetching interact positively

Page 29: Interactions Between Compression and Prefetching in Chip Multiprocessors Alaa R. Alameldeen* David A. Wood Intel CorporationUniversity of Wisconsin-Madison.

29

Google “Wisconsin GEMS”

Works w/ Simics (free to academics) commercial OS/apps

SPARC out-of-order, x86 in-order, CMPs, SMPs, & LogTM

GPL release of GEMS used in four HPCA 2007 papers

DetailedProcessor

Model

OpalSimics

Microbenchmarks

RandomTester

Dete

rmin

isti

c

Con

ten

ded

lock

s

Tra

ce fl

ie

Page 30: Interactions Between Compression and Prefetching in Chip Multiprocessors Alaa R. Alameldeen* David A. Wood Intel CorporationUniversity of Wisconsin-Madison.

30

Backup Slides

Page 31: Interactions Between Compression and Prefetching in Chip Multiprocessors Alaa R. Alameldeen* David A. Wood Intel CorporationUniversity of Wisconsin-Madison.

31

Frequent Pattern Compression (FPC)

A significance-based compression algorithm

– Compresses each 32-bit word separately– Suitable for short (32-256 byte) cache lines– Compressible Patterns: zeros, sign-ext. 4,8,16-bits,

zero-padded half-word, two SE half-words, repeated byte

– A 64-byte line is decompressed in a five-stage pipeline

More details in Alaa’s thesis

Page 32: Interactions Between Compression and Prefetching in Chip Multiprocessors Alaa R. Alameldeen* David A. Wood Intel CorporationUniversity of Wisconsin-Madison.

32

Positive and Negative Interactions

0

1

2

3

4

5

6

7

8

Base Base+A Base+B Base+A+B(No

Interaction)

Base+A+B(Positive

Interaction)

Base+A+B(Negative

Interaction)

Speedup

Page 33: Interactions Between Compression and Prefetching in Chip Multiprocessors Alaa R. Alameldeen* David A. Wood Intel CorporationUniversity of Wisconsin-Madison.

33

Interaction Between Adaptive Prefetching and Compression

-5

0

5

10

15

20

Inte

ract

ion (

%)

apache zeus oltp jbb art apsi fma3d mgrid

Less BW impact Less BW impact Interaction slightly -ve (except art, mgrid)Interaction slightly -ve (except art, mgrid)

Page 34: Interactions Between Compression and Prefetching in Chip Multiprocessors Alaa R. Alameldeen* David A. Wood Intel CorporationUniversity of Wisconsin-Madison.

34

Positive Interaction: Pin Bandwidth

Compression saves bandwidth consumed by prefetchingCompression saves bandwidth consumed by prefetching

Page 35: Interactions Between Compression and Prefetching in Chip Multiprocessors Alaa R. Alameldeen* David A. Wood Intel CorporationUniversity of Wisconsin-Madison.

35

Negative Interaction: Avoided Misses

Small Fraction of misses is avoided by bothSmall Fraction of misses is avoided by both

Page 36: Interactions Between Compression and Prefetching in Chip Multiprocessors Alaa R. Alameldeen* David A. Wood Intel CorporationUniversity of Wisconsin-Madison.

36

Sensitivity to Processor Count

Page 37: Interactions Between Compression and Prefetching in Chip Multiprocessors Alaa R. Alameldeen* David A. Wood Intel CorporationUniversity of Wisconsin-Madison.

37

Compression Speedup

Page 38: Interactions Between Compression and Prefetching in Chip Multiprocessors Alaa R. Alameldeen* David A. Wood Intel CorporationUniversity of Wisconsin-Madison.

38

Compression Bandwidth Demand

Page 39: Interactions Between Compression and Prefetching in Chip Multiprocessors Alaa R. Alameldeen* David A. Wood Intel CorporationUniversity of Wisconsin-Madison.

39

Commercial Workloads & Adaptive Prefetching