University of Wisconsin-MadisonPotentially 3.9X larger LLC VSC-2X achieves only 1.6X LLC effective...

1
Maximize LLC effective capacity to reduce system energy! Compressed Caching: Compressing and compacting cache blocks Limits of previous work: - Limited number of tags - Internal fragmentation - Energy-expensive compactions Potentials: + Higher effective cache size + Low area overhead + Higher system performance + Lower system energy Limited # Tags Internal fragmentation Potentially 3.9X larger LLC VSC-2X achieves only 1.6X LLC effective capacity Compaction LLC dynamic energy Most workloads exhibit spatial locality. DCC exploits spatial locality to improve compression effectiveness: Uses decoupled super-blocking to track more blocks with low area overhead. Compresses and allocates a block into non-contiguous data sub-blocks. Co-DCC (Co-compacted DCC): Co-compacting blocks of a super-block to reduce internal fragmentation. We integrate (Co-)DCC with AMD Bulldozer LLC. No need for an alignment network We implement the tag match and the sub-selection logic in Verilog. No need for an alignment network Tag P Blk # Tag Match and Sub-Block Selection Pair(P): A, C Sub-Blocked Data Array A2 A0 C Index P Tag Array 1 0 Sub-Blocked Back Pointer Array 1,2 1,0 C A Tag P ,(V,C), (I,I), (V,C), (I,I) 1,0 1,0 A1 Decoupled Super-Blocks Non-contiguous Sub-Blocks We model a multicore system with GEMS. We use workloads from Commercial workloads, SPEC-OMP, PARSEC, and SPEC CPU2006. We use Cacti to measure (Co-)DCC power and area. (Co-)DCC: Performs better than a conventional LLC of twice the capacity. Boosts system performance by 14% on average (up to 38%). Saves system energy by 12% on average (up to 39%). 0.40 0.50 0.60 0.70 0.80 0.90 1.00 FixedC VSC-2X 2X Baseline DCC Co-DCC Norm Runtime 0.40 0.50 0.60 0.70 0.80 0.90 1.00 FixedC VSC-2X 2X Baseline DCC Co-DCC Normalized System Energy apache jbb oltp zeus ammp applu equake mgrid wupwise black canneal freqmine m1 m2 m3 m4 m5 m6 m7 m8 GEOMEAN Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching Somayeh Sardashti and Professor David A. Wood University of Wisconsin-Madison 1 1.5 2 1 1.5 2 2.5 3 Normalized LLC Area Normalized Effective LLC Capacity Baseline FixedC VSC-2X DCC Co-DCC 2X Baseline 16B 64B 3bits Cores Eight OOO cores, 3.2 GHz L1I$/L1D$ Private, 32-KB, 8-way L2$ Private, 256-KB, 8-way L3$ Shared, 8-MB, 16-way, 8 banks Main Memory 4GB, 16 Banks, 800 MHz bus frequency DDR3 Access to main memory vs. LLC: 6X Longer Latency 60X Higher Energy Cost Why not double the LLC? 15%-30% of on-chip area 2X LLC Area Intel Nehalem

Transcript of University of Wisconsin-MadisonPotentially 3.9X larger LLC VSC-2X achieves only 1.6X LLC effective...

Page 1: University of Wisconsin-MadisonPotentially 3.9X larger LLC VSC-2X achieves only 1.6X LLC effective capacity Compaction LLC dynamic energy Most workloads exhibit spatial locality. •

Maximize LLC effective capacity to reduce system energy! Compressed Caching: Compressing and compacting cache blocks

Limits of previous work: - Limited number of tags - Internal fragmentation - Energy-expensive compactions

Potentials: + Higher effective cache size + Low area overhead + Higher system performance + Lower system energy

Limited # Tags

Internal fragmentation

Potentially 3.9X larger LLC

VSC-2X achieves only 1.6X LLC effective capacity

Compaction LLC dynamic energy

Most workloads exhibit spatial locality.

• DCC exploits spatial locality to improve compression effectiveness: • Uses decoupled super-blocking to track more blocks with low area overhead. • Compresses and allocates a block into non-contiguous data sub-blocks.

• Co-DCC (Co-compacted DCC): • Co-compacting blocks of a super-block to reduce internal fragmentation.

We integrate (Co-)DCC with AMD Bulldozer LLC. • No need for an alignment network

We implement the tag match and the sub-selection logic in Verilog. • No need for an alignment network

Tag P Blk #

Tag Match and Sub-Block SelectionPair(P): A, C

Sub-Blocked Data Array

A2 A0CIndex P

Tag Array

1 0

Sub-Blocked Back Pointer Array

1,2 1,0

CA

Tag P,(V,C), (I,I), (V,C), (I,I) 1,0 1,0 A1

Decoupled

Super-Blocks

Non-contiguous

Sub-Blocks

• We model a multicore system with GEMS. • We use workloads from Commercial workloads, SPEC-OMP, PARSEC, and SPEC CPU2006. • We use Cacti to measure (Co-)DCC power and area.

(Co-)DCC: • Performs better than a conventional LLC of twice the capacity. • Boosts system performance by 14% on average (up to 38%). • Saves system energy by 12% on average (up to 39%).

0.40

0.50

0.60

0.70

0.80

0.90

1.00

Fixe

dC

VSC

-2X

2X B

asel

ine

DC

C

Co

-DC

C

No

rm R

un

tim

e

0.40

0.50

0.60

0.70

0.80

0.90

1.00

Fixe

dC

VSC

-2X

2X B

asel

ine

DC

C

Co

-DC

CNo

rmal

ized

Sys

tem

En

ergy

apachejbboltpzeusammpappluequakemgridwupwiseblackcannealfreqminem1m2m3m4m5m6m7m8GEOMEAN

Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching

Somayeh Sardashti and Professor David A. Wood University of Wisconsin-Madison

1

1.5

2

1 1.5 2 2.5 3

No

rmal

ize

d L

LC A

rea

Normalized Effective LLC Capacity

Baseline

FixedC

VSC-2X

DCC

Co-DCC

2X Baseline

16B

64B

3bits

Cores Eight OOO cores, 3.2 GHz

L1I$/L1D$ Private, 32-KB, 8-way

L2$ Private, 256-KB, 8-way

L3$ Shared, 8-MB, 16-way, 8 banks

Main Memory 4GB, 16 Banks, 800 MHz bus frequency DDR3

Access to main memory vs. LLC: 6X Longer Latency 60X Higher Energy Cost Why not double the LLC? 15%-30% of on-chip area 2X LLC Area

Intel Nehalem