Compute Cache - eecs.umich.edu - Application-Specific Archs.pdf · Compute Cache Shaizeen Aga,...
Transcript of Compute Cache - eecs.umich.edu - Application-Specific Archs.pdf · Compute Cache Shaizeen Aga,...
Compute CacheShaizeen Aga, Supreet Jeloka, Arun Subramaniyan, Satish Narayanasamy, David Blaauw, and Reetuparna Das
1
Presented by Gefei Zuo and Jiacheng Ma
Agenda
● Motivation● Goal● Design● Evaluation● Discussion
2
Motivation
● Data-centric applications● High degree data-parallelism of applications● Time and energy spent on moving data >> actual computing
3
Goal
In-place computation in caches
● Massive parallelism: new vector instructions (SIMD)● Less data movement: new cache organization● Low area overhead: reuse existing circuits
4
Design Overview
5
Cache hierarchy Cache Geometry In-place compute
2. bit-line computing
1. ISA
3. operand locality
4. execution management
5. cache coherence
Design: ISA
6
Design: bit-line computing
Background: 6 transistors bit-cell, two stable states
7
Read:1. Precharge BL=1 BLB=12. Assert word line3. Sense amplifier detects voltage difference
Write:1. Drive BL, #BL to desired value, ~BL==BLB2. Assert word line
Design: bit-line computing cont.
8
Logical operation:1. Precharge BL=Vref, BLB=Vref2. Activate two rows (WLi, WLj)3. Sensing BL => AND, sensing BLB => NOR
○ BL or BLB is sensed as ‘1’ only if two activated bits are both ‘1’
XOR = wired-NOR(NOR, AND)
Design: operand locality
Operand locality requirement: physically share the same set of bitlines
Design choices:
● All ways in a set mapped to the same block partition.
● Use part of the set-index bits to select bank and block position.
9
Need only at most 12 bits to guarantee operand locality. 2^12=4K=pagesizePage-alignment guarantees operand locality
Design: operand locality cont.
When cannot achieve perfect locality: one additional vector logic unit per cache controller
● High area overhead○ 16MB L3 with 512 sub-arrays = 128 * 64B logic units
● High latency (14 cycles vs 22 cycles)● High energy consumption (60%~80% energy spent on H-Tree wire transfer)
10
Design: execution managementEnhanced cache controller:
● Break instruction into operations○ operations’ operands are at most one single cache block
● Deploy key to block○ for cc_search
● Split instructions into multiple instructions○ if the operands are too long○ by raise a pipeline exception
11
instruction
OP OP OP OP
cc_search
key key key key
Design: operation cache levelOperate on the cache level where:
● that is highest● that contains all operands
Example:
● cc_and● operands: A, B, C
○ A in L2, B in L2, C in memory● perform operation in L3
12
Design: cache coherence & memory model
● No influence to cache coherence○ Cache coherence requests are responded first○ When a request comes:
i. The cache-line is unlockedii. The cache-line is marked invalid (maybe)iii. The cache-line is re-fetched
● No influence to memory model○ fence instruction is still usable in CC
13
Evaluation
● Environment○ SniperSim (simulator) & McPAT (power)○ 8 core CMP with 32K L1d, 256K L2 and 16M L3
● Benchmarks○ WordCount: cc_search○ StringMatch: cc_cmp○ DB-BitMap: cc_or, cc_and○ Bit Matrix Multiplication: cc_clmul○ Checkpointing: cc_copy
14
Evaluation
● Delay○ negligible impact on the baseline read/write○ and/or/xor: 3x latency than simple access○ the rest: 2x
● Energy○ cmp/search/clmul: 1.5x○ copy/buz/not: 2x○ rest: 2.5x
● Area○ 8% overhead
15
Evaluation: Benefits
16
53x
data parallelism latency reduction
90% 89% 71% 92%
runtime is shortened
data movement is reduced
key distributionBMM: 3.2xWordcount: 2x
StringMatch: 1.5xDB-BitMap: 1.6x
Evaluation: different configuration
17
in-place vs. near-place● compute inside cache ● compute inside cache controller
L1 vs. L2 vs. L3
CC_L3 still accesses L1 and L2
Summary & Weakness
● Compute in cache!○ Compute inside cache with bit-line hacking○ A bunch of instructions○ High throughput○ Great power efficiency○ Few area overhead
● Weakness:○ how about moving data to L3? This may cost more energy...
18
Discussion
● In-Cache Computing vs. In-Memory Computing○ Which do you think is better?
● Any idea to perform more complex operation?○ For example, add? See Prof Das’ following work.
● How about make cache computing more programmable?○ Something like FPGA?
19