DASX : Hardware Accelerator for Software Data Structures
Snehasish Kumar, Naveen Vedula, Arrvindh Shriraman (Simon Fraser University),
Vijayalakshmi Srinivasan (IBM Research)
DASX : Hardware Accelerator for Software Data Structures
2
Executive Summary
Datavoid simple() { for (int i = 0; i<size; ++i){ a[i] = b[i] + c[i]; }}
mov $400, %r1 mov $4, %r2 mul %r3, %r2 add %r1, %r2 ld (%r2), %r4
for each array
element!
mov
CORE
Reorder Buffer
mov $400, %r1 mov $4, %r2 mul %r3, %r2 add %r1, %r2 ld (%r2), %r4
for each array
element!
movmuladdldExtra work encumbers the core!
DASX : Accelerate the access of and compute on software data structures
H1
H2
H3
H4
H5
High level info lost!
DASX : Hardware Accelerator for Software Data Structures
3
Outline
– Challenges of data-centric applications
– Existing mechanisms to address challenges
– DASX : Data Structure Accelerator
– Benchmarks and Evaluation
DASX : Hardware Accelerator for Software Data Structures
4
Challenge 1/3 : Instruction Overhead
1D Vector : 2 2D Vector : 3 6D Vector : 12
Instructions / Element
OLAP Cube [Gray et al. DMKD ‘96] upto 15D!
Unordered Set : avg. 12 instructionsBTree : 100s of instructionsCOMPUTE DATA9% 66%
void simple() { for (int i = 0; i<size; ++i){ a[i] = b[i] + c[i]; }}
DASX : Hardware Accelerator for Software Data Structures
5
Challenge 2/3 : Memory Level Parallelism
Each element independent
mov
CORE
Reorder Buffer
mov $400, %r1 mov $4, %r2 mul %r3, %r2 add %r1, %r2 ld (%r2), %r4
for each array
element!
movmuladdld
Cant discover more MLP!
Accessing multiple data structures makes this worse!
DASX : Hardware Accelerator for Software Data Structures
6
Challenge 3/3 : Managing Cache Space
CPU
L1
L2
MEM
Not enough space in cache
DASX : Hardware Accelerator for Software Data Structures
7
Outline
– Challenges of data-centric applications
– Existing mechanisms to address challenges
– DASX : Data Structure Accelerator
– Benchmarks and Evaluation
DASX : Hardware Accelerator for Software Data Structures
8
Existing Mechanisms – Prefetching
+ Increases Memory Level Parallelism– Increases instructions (SW PF)– Best effort (HW PF)– Can cause cache thrashing
void simple() { for (int i = 0; i<size; ++i){ prefetch(a + k); prefetch(b + k); prefetch(c + k); a[i] += b[i] + c[i]; }}
add
Reorder Buffer
addprefmovload
DASX : Hardware Accelerator for Software Data Structures
9
Existing Mechanisms – SIMD
+ Reduce Instructions– Algorithm change– Increase power
void simple(){ for (int i = 0; i<size; i+=k){ SIMD_LOAD(a[i]:a[i+k]); SIMD_LOAD(b[i]:b[i+k]); SIMD_LOAD(c[i]:c[i+k]); SIMD_ADD(a[…], b[…], c[…]); }}
addloadadd
addloadadd
Reorder Buffer
DASX : Hardware Accelerator for Software Data Structures
10
Outline
– Challenges of data-centric applications
– Existing mechanisms to address challenges
– DASX : Data Structure Accelerator
– Benchmarks and Evaluation
DASX : Hardware Accelerator for Software Data Structures
11
CACHE
OOOCORE
Our Approach – DASX
SHARED LASTLEVEL CACHE
Colle
ctor
Proc
essi
ngEl
emen
ts(P
Es)
DASX
Data structure specific fetch engine
Lightweight pipelinesAll ins. fixed latency
DASX : Hardware Accelerator for Software Data Structures
12
DASX – Sample Programmer’s APIvoid simple() { for (int i=0; i<size; ++i){ a[i] = b[i] + c[i]; }}
coll_a = new coll(ST, &a, INT, size, 0, VEC);coll_b = new coll(LD, &b, INT, size, 0, VEC);coll_c = new coll(LD, &c, INT, size, 0, VEC);
BEGIN SIMPLE
END SIMPLE
auto kfn = [](auto i, auto j) { return i + j;}
Initialize Collectorsgroup::add(coll_a, coll_b, coll_c);
start(kfn, size);
Run in lock-step
Start processing
DASX : Hardware Accelerator for Software Data Structures
13
DASX – Data Structure Accelerator
1
CACHE
MEMTranslate key, fetch elements
2Allocate
3 Lock iteration data4
Fill local storage
5 Compute (SPMD) STOPGO
6Write back dirty data
7 Unlock iteration data
STOP
Colle
ctor
PEs
DASX : Hardware Accelerator for Software Data Structures
14
Colle
ctor
DASX – Data Structure Accelerator
CACHE
MEM
Lock iteration data
Write back dirty data
STOP Compute (SPMD)
Fill local storage
1
Translate key, fetch elements
Allocate2
3
7 Unlock iteration data4
6
5
DECOUPLEDACCESS (1 – 3)
EXECUTE (5 – 7)
PEs
DASX : Hardware Accelerator for Software Data Structures
15
Challenges Recap
Challenge 1 : Reduce Instruction Overhead
Challenge 2 : Increase Memory Level Parallelism
Challenge 3 : Better Cache Management
DASX : Hardware Accelerator for Software Data Structures
16
DASX – Processing ElementsInstruction
Memory (1KB)
REG (32)
REG(32)
…
…
LANE 1 LANE 8…
Features• 3 stage pipeline• Single Program Multiple Data• Each PE – exec. 1 iteration
• No address generation• Reference data using “keys”
“Reduce Instruction Overhead” by using SPMD Model and removing address
generation.
DASX : Hardware Accelerator for Software Data Structures
17
DASX – Key Interface Vector Keys LD Key == LD Iter * Size + Offset
Hash Table KeysLD KEY
BTree Keys
1 2 30
Key Data
0 Data
1 Data 2 Data
Remove address generation overhead
DASX : Hardware Accelerator for Software Data Structures
18
DASX – Collector
Data structure fetch engine• Specialize traversal• User defined elements
Data Structure Collector HW OPVector Address / Stride Calc. – ADD, CMP
Hash Table Index Calc + Bucket Traversal. – INT ALU BTree Traversal – CMOV, ADD, CMP
Tasks – 1) Prefetch 2) Manage Cache Space
DASX : Hardware Accelerator for Software Data Structures
19
Collector Task 1 : Prefetch
1
CACHE
MEMTranslate keys,fetch elements
2 Allocate
• Run asynchronously with compute• Reduce address generation cost • Granularity of access : Data structure element• Enhanced memory level parallelism
Colle
ctor
DASX : Hardware Accelerator for Software Data Structures
20
Collector Task 2 : Manage Cache Space
CACHE
3 Lock iteration data
4 Fill local storage6 Write back dirty data
7 Unlock iteration data
• Manage cache fill and replacement• Bulk fill OBJ-Store before iteration• Per element refill from cache to OBJ-Store
Colle
ctor
PEs
OBJ-Store
DASX : Hardware Accelerator for Software Data Structures
21
Outline
– Challenges of data-centric applications
– Existing mechanisms to address challenges
– DASX : Data Structure Accelerator
– Benchmarks and Evaluation
DASX : Hardware Accelerator for Software Data Structures
22
BenchmarksRecommender Text Search Hash Table
OLAP Cubing BTree Black-Scholes
H1
H2
H3
H4
H5
DASX : Hardware Accelerator for Software Data Structures
23
Evaluation – Setup
DASX
vs
8
1KB
32 KB L1
32 KB L1
IO CORE IO CORE
MT (8 threads)
LLC – 4MB, 16 WAY, NUCA
DRAM – DDR2-400, 16GB, 4 Chn.
64 KB L1
OOO CORE
vs
OOO
DASX : Hardware Accelerator for Software Data Structures
24
Evaluation – Performance Breakdown
1.25
0.00
0.25
0.50
0.75
1.00
D. Cube(Memory Bound)
Black.(Compute Bound)
1 In-Order Core at LLC
Normalized to OOO Core ( Lower is better)
+ Collector(data structure engine)
– Address Gen. + Local Store
X 8MT
MT
DASX : Hardware Accelerator for Software Data Structures
25
Evaluation – Performance
D.CubeReco
.BTree
Hash.
Black.
Text.0
5
10
15
20DASX 2C-4T
Spee
dup
(Hig
her i
s be
tter
)
23.2
158
(Normalized to OOO)MT (8)
DASX : Hardware Accelerator for Software Data Structures
26
Evaluation – Energy vs Performance
0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20 0.220.00
0.25
0.50
0.75
1.00
Execution Cycles
Ener
gyData-Cubing
MT-32
MT-16MT-8
DASX-4DASX-8
OOO
Best
DASX : Hardware Accelerator for Software Data Structures
27
Summary
Highlighted the challenges of data-centric workloads
Demonstrated the effectiveness of using data structure specific information
Data structure aware hardware accelerator achieves 4.4X performance improvement
DASX : Hardware Accelerator for Software Data Structures
28
Q & A
DASX : Hardware Accelerator for Software Data Structures
29
Backup
1.Percentage of data structure instructions – 302.Why collector groups? – 313.Energy breakdown – 324.Obj-Store details – 335.Address Translation for keys – 34
DASX : Hardware Accelerator for Software Data Structures
30
Percentage of data structure instructions
DASX : Hardware Accelerator for Software Data Structures
31
Why collector groups
DASX : Hardware Accelerator for Software Data Structures
32
Evaluation – Energy Reduction
D.CubeReco.
BTreeHash.
Black.Text.
0
1
2
3
4
5
0
15
30
45
60
75
NW-DASX NW-2C-4TCache-DASX Cache-2C-4T
Net
wor
k (H
ighe
r is b
etter
)
Cach
e (H
ighe
r is b
etter
)
32.7 6.5
12.2
Streaming Cache Thrashing
DASX : Hardware Accelerator for Software Data Structures
33
DASX – OBJ-Store
Reduce energy – filter access to LLC
Organization : Decoupled sector cache (1KB)• Minimize tag overhead for vectors• Adapt to spatial locality (eg. struct fields)
KEY V/I LLC*
Tag
LD / ST – PE Write backs
Data
DASX : Hardware Accelerator for Software Data Structures
34
DASX – Address Translation for Keys
• Reduce energy overhead
• Keys are coalesced by the collector into cache lines
• Only one translation per line vs. per access
• No reverse translation, due to back pointer (refer OBJ-Store)
Top Related