Compute Cache - eecs.umich.edu - Application-Specific Archs.pdf · Compute Cache Shaizeen Aga,...

Compute CacheShaizeen Aga, Supreet Jeloka, Arun Subramaniyan, Satish Narayanasamy, David Blaauw, and Reetuparna Das

1

Presented by Gefei Zuo and Jiacheng Ma

Agenda

● Motivation● Goal● Design● Evaluation● Discussion

2

Motivation

● Data-centric applications● High degree data-parallelism of applications● Time and energy spent on moving data >> actual computing

3

Goal

In-place computation in caches

● Massive parallelism: new vector instructions (SIMD)● Less data movement: new cache organization● Low area overhead: reuse existing circuits

4

Design Overview

5

Cache hierarchy Cache Geometry In-place compute

2. bit-line computing

1. ISA

3. operand locality

4. execution management

5. cache coherence

Design: ISA

6

Design: bit-line computing

Background: 6 transistors bit-cell, two stable states

7

Read:1. Precharge BL=1 BLB=12. Assert word line3. Sense amplifier detects voltage difference

Write:1. Drive BL, #BL to desired value, ~BL==BLB2. Assert word line

Design: bit-line computing cont.

8

Logical operation:1. Precharge BL=Vref, BLB=Vref2. Activate two rows (WLi, WLj)3. Sensing BL => AND, sensing BLB => NOR

○ BL or BLB is sensed as ‘1’ only if two activated bits are both ‘1’

XOR = wired-NOR(NOR, AND)

Design: operand locality

Operand locality requirement: physically share the same set of bitlines

Design choices:

● All ways in a set mapped to the same block partition.

● Use part of the set-index bits to select bank and block position.

9

Need only at most 12 bits to guarantee operand locality. 2^12=4K=pagesizePage-alignment guarantees operand locality

Design: operand locality cont.

When cannot achieve perfect locality: one additional vector logic unit per cache controller

● High area overhead○ 16MB L3 with 512 sub-arrays = 128 * 64B logic units

● High latency (14 cycles vs 22 cycles)● High energy consumption (60%~80% energy spent on H-Tree wire transfer)

10

Design: execution managementEnhanced cache controller:

● Break instruction into operations○ operations’ operands are at most one single cache block

● Deploy key to block○ for cc_search

● Split instructions into multiple instructions○ if the operands are too long○ by raise a pipeline exception

11

instruction

OP OP OP OP

cc_search

key key key key

Design: operation cache levelOperate on the cache level where:

● that is highest● that contains all operands

Example:

● cc_and● operands: A, B, C

○ A in L2, B in L2, C in memory● perform operation in L3

12

Design: cache coherence & memory model

● No influence to cache coherence○ Cache coherence requests are responded first○ When a request comes:

i. The cache-line is unlockedii. The cache-line is marked invalid (maybe)iii. The cache-line is re-fetched

● No influence to memory model○ fence instruction is still usable in CC

13

Evaluation

● Environment○ SniperSim (simulator) & McPAT (power)○ 8 core CMP with 32K L1d, 256K L2 and 16M L3

● Benchmarks○ WordCount: cc_search○ StringMatch: cc_cmp○ DB-BitMap: cc_or, cc_and○ Bit Matrix Multiplication: cc_clmul○ Checkpointing: cc_copy

14

Evaluation

● Delay○ negligible impact on the baseline read/write○ and/or/xor: 3x latency than simple access○ the rest: 2x

● Energy○ cmp/search/clmul: 1.5x○ copy/buz/not: 2x○ rest: 2.5x

● Area○ 8% overhead

15

Evaluation: Benefits

16

53x

data parallelism latency reduction

90% 89% 71% 92%

runtime is shortened

data movement is reduced

key distributionBMM: 3.2xWordcount: 2x

StringMatch: 1.5xDB-BitMap: 1.6x

Evaluation: different configuration

17

in-place vs. near-place● compute inside cache ● compute inside cache controller

L1 vs. L2 vs. L3

CC_L3 still accesses L1 and L2

Summary & Weakness

● Compute in cache!○ Compute inside cache with bit-line hacking○ A bunch of instructions○ High throughput○ Great power efficiency○ Few area overhead

● Weakness:○ how about moving data to L3? This may cost more energy...

18

Discussion

● In-Cache Computing vs. In-Memory Computing○ Which do you think is better?

● Any idea to perform more complex operation?○ For example, add? See Prof Das’ following work.

● How about make cache computing more programmable?○ Something like FPGA?

19

Compute Cache - eecs.umich.edu - Application-Specific Archs.pdf · Compute Cache Shaizeen Aga,...

Documents

Transcript of Compute Cache - eecs.umich.edu - Application-Specific Archs.pdf · Compute Cache Shaizeen Aga,...