FlashAbacus: A Self-Governing Flash-Based Accelerator for...

42
FlashAbacus: A Self-Governing Flash-Based Accelerator for Low-Power Systems Jie Zhang and Myoungsoo Jung Computer Architecture and Memory Systems Lab

Transcript of FlashAbacus: A Self-Governing Flash-Based Accelerator for...

Page 1: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power

FlashAbacus:A Self-Governing Flash-Based

Accelerator for Low-Power Systems

Jie Zhang and Myoungsoo JungComputer Architecture and Memory Systems Lab

Page 2: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power

Executable Summary

Traditional heterogeneous compute system• Long data path between accelerator and storage;• Accelerators cost high power;

Intel 750 SSD DRAM CPU Xeon Phi

22W 7W 91W 300W

Abacus

10W6W

NAND Flash

low‐power

No data movement

Major ResultsPerformance: 127% better than traditional heterogeneous system.Energy: reduce 78% of energy compared to traditional approach.

FlashAbacus

Page 3: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power

Example: Top-500 HPC trendsSystem

s using

 cop

rocessor/accelerators

18%Accelerator is a promising solution, but it also faces several challenges

Page 4: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power

power consumption

Challenge1: power consumption

The power consumption renders it difficult from being accepted in low-power system.

300W

180W

20W

Page 5: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power

Challenge2: data movement overhead

32% storage

23% movement

45% computation

Page 6: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power

Challenge2: data movement overhead

17% storage

64% movement

19% computation

Page 7: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power

Challenge2: data movementDiscrete Hardware:

i) Storage to device memory

ii) Device to host‐side DRAM

DRAMDRAMDRAMMain CPU

IO Controller

DRAM

EMPs

MemoryNorthbridgeCache

Storage MediaSSD

Accelerator

SSD

IO Controller

NorthbridgeDRAMDRAMDRAMDRAM

iii) Host‐side DRAM  to user process

DRAMDRAMDRAMDRAM Northbridge

Main CPU

vi)  User process to accelerator DRAM

Main CPU

Northbridge Memory

Page 8: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power

Challenge2: data movementDiscrete Software Stack:

User Space

Kernel Space

Device Space

Data‐intensive Application

Acc. RuntimeI/O Runtime

File SystemAcc. DriverHBA Driver

AcceleratorSSDStorage S/W Stack Acc. S/W Stack

Firmware

HBA Driver

Firmware

SSD

HBA DriverFile System

I/O Runtime

Data‐intensive Application

Acc. Runtime

Acc. DriverAcc. Driver

Accelerator

Page 9: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power

Challenge3: accelerator utilization

Low-power compute system is sensitive to the serial program codes.

Page 10: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power

Challenge3: accelerator utilization

79% 76%

Page 11: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power

FlashAbacusOur solution--FlashAbacus:i. Reduce power consumption;ii. Eliminating redundant data copy and long data path;iii.Improve core utilization;

power consumption

300W180W

20WFlashAbacus

FlashAbacus

Page 12: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power

A glance of hardwareMany‐core Host

Memory

NorthBridge

Core Core CoreCore Core Core

IO Controller

SSD

EMPsCache

Memory flashflash

corecoreProcessor

Flash

Accelerator

Storage

Heterogeneous Platform

Our PlatformAccelerator

Page 13: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power

Tier‐1 Network

Tier‐2 NetworkNetwork

Inside Accelerator

Flash backbone

FPGA

Ctrle

r FlashFlashFlash

FPGA

Ctrle

r FlashFlashFlash

LWP0 LWP1 LWPn

PCIeControllerN

orth 

Bridge Scratch

padShared Mem(DDR3L)PSC

Flash‐based Storage

GPDSPCores

PeripheralComponents

Flashvisor

Storengine

Kernelexe.

Page 14: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power

Programming model

kernelGen

loop(optional)

kernelExe

dataSave (optional)

HostAccelerator

kernelOffload

fopen()malloc()

loop

Acc‐Malloc()

fread()Acc‐Memcpy()Acc‐kernel()

fwrite()Acc‐Memcpy()

free()

fclose()

I/O Runtime

Acc‐Free()

Epilogue

Prologue

Body

Acc. Runtimefopen()malloc()

loop

Acc‐Malloc()

fread()Acc‐Memcpy()Acc‐kernel()

fwrite()Acc‐Memcpy()

free()

fclose()

I/O Runtime

Acc‐Free()

Epilogue

Prologue

Body

Acc. Runtime

Traditional Programming Model

Traditional Programming Model

FlashAbacusProgramming Model

parallelserialserial

Page 15: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power

Software Development

• Fuse flash in a multi-core system• Parallel kernel execution

Page 16: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power

Fuse flash in a multi-core system

Data access model

LWP

L2$ b

a

DRAM

c

Flash

?Storage access w/o OS?Storage management?

Page 17: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power

Flash VirtualizationFlashvisor: No OS/FS • Directly expose flash address space to LWPs.• Map flash address space to internal DRAM.

Manage storage access • Maintain a simplified page mapping table.• Translate from LBA to PPN.

Protection & access control• Maintain a range lock for parallel data access.

Storengine: manage flash background tasks such as garbage collection and log dumping.

Page 18: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power

Flash VirtualizationRead

KernelMessage1

Flashvisor

2 Lock inquiry

Range lock

Page table lookup3Scratchpad

4 I/O

FPGA

FlashFlashFlash

5 DMA

LPDDR36 Read

Ch# Page group#

Page Table

Inde

x

pkg#

Logical Address

Address Translation

Physical Address

StartPage

StartPage

StartPage

StartPage

Search

Startpage

StartPage

StartPage

RB tree

Page 19: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power

Flash VirtualizationWrite

FPGAFlashFlashFlash

Kernel 1 Write LPDDR3Message2

Flashvisor

Lock inquiry3

Range lock

I/O5

4 Reclaim blockStoregine

Garbagecollection

Page table snapshot

DMA5Page table update6Scratchpad

Page 20: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power

Software Development

• Fuse flash in a multi-core system• Parallel kernel execution

Page 21: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power

Parallel Kernel Execution

FlashAbacus

Host/User

App1()App2()App3()

Accelerator

Flashv

isor

Parallel Execution

Kernel

0Ke

rnel

1Ke

rnel

2Ke

rnel

0Ke

rnel

1Ke

rnel

2Ke

rnel

0Ke

rnel

1Ke

rnel

2

StorageLPDDR3 FP

GAFP

GAFP

GA

Address management

Parallel execution model:Master thread

Conventional

Require OS thread managementHost-accelerator communication

No hostintervention

Page 22: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power

Coarse-granule SchedulingInter-kernel static scheduling (InterSt):• Bind a user application to a specific LWP.

Inter-kernel dynamic scheduling (InterDy):• Flashvisor schedules kernels to LWPs which are in idle.

k1k0App0

App2 k2 k3

T0Arrive Time

LWP0

LWP1LWP2LWP3

T0 T1 T2 T3 T4 T5 T6 T7

k0 k1

k2 k3

k0 LATENCYk1 LATENCYk2 LATENCYk3 LATENCY

LWP0

LWP1LWP2LWP3

k0

k1

T0 T1 T2 T3 T4 T5 T6 T7

k2k3 SAVED

k0 LATENCYk1 LATk2 LATENCYk3 LATENCY

SAVED

Page 23: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power

Fine-granule SchedulingPartition kernel into microblocks:

An example of FDTD-2D

_fict_[0]ey[0][j] = FOR  j = 0..3 

ENDFORFOR  i = 0..3 FOR  j = 0..3 

ey[i][j] = ENDFORFOR  j = 1..3 

ENDFORENDFORFOR  i = 0..3 FOR  j = 0..3 

ENDFORENDFOR

screen

Kernel

Microblock 0

Microblock 1

Microblock 2

Page 24: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power

Fine-granule SchedulingIntra-kernel Out-of-order scheduling (IntraO3):• Schedule microblocks from all kernels across LWPs.• Pros: maximize core utilization• Cons: make sure running microblocks have no dependency

LWP0LWP1LWP2LWP3

T0 T1 T2 T3 T4 T5 T6 T7

k012

k0ab

k01 k0a1

2 k0ak012

k0ak0b

k0 LAT.k1 LAT.k2 LATENCYk3 LATENCY

SAVEDSAVED

SAVED

k0App0

App2T0 Arrive Time

k0 k11 2 a b 1 a

k2 k3k01 2 a k01 2 a

k01 2 Microblock 0a b Microblock 1

b

Page 25: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power

Experiment SetupSystem configuration:

Host Xeon 2620‐v3

LWPs 8 @ 1GHz

SSD access latency Read Lat.=25us, Write Lat.=800us

Workloads Polybench benchmark suits

Accelerator Configuration:• SIMD: use OpenMP and has discrete storage and accelerator;• InterSt: FlashAbacus with static inter‐kernel scheduling;• InterDy: FlashAbacus with dynamic inter‐kernel scheduling;• InterO3: FlashAbacus with out‐of‐order intra‐kernel scheduling.

Page 26: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power

EvaluationTime series analysis

IntraO3 has shorter storage access time than SIMD, as it eliminate the data movement overhead.

IntraO3 has shorter storage access time than SIMD, as it eliminate the data movement overhead.

IntraO3 has shorter compute time, because dynamic scheduling can improve core utilization.

IntraO3 has shorter compute time, because dynamic scheduling can improve core utilization.

Storage Access

Storage Access

ComputeCompute

Page 27: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power

EvaluationEnergy

FlashAbacus drastically reduce the energy of data movement.FlashAbacus drastically reduce the energy of data movement.

Page 28: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power

Thank you

Page 29: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power

Backup

Page 30: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power

Performance Evolution in Computing

Single‐Core Era

Constrained by: Power Complexity

Multi‐Core Era

Constrained by: Power Scalability

HeterogeneousSystem Era

Enabled by: Data parallelism High‐performance

acceleratorIntel Xeon‐phi

GPGPU

Page 31: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power

Challenge2: data movement

Storage access accounts for a large ratio of total execution time.

Page 32: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power

Parallel Kernel Executionmanage the kernel scheduling to maximize execution throughput of all LWPs.

App()

Host/User

Kernel 0

Kernel 1

Kernel n

Parallel Execution

Kernel 2

Flashv

isor

FPGA

FPGA

FPGA

StorageLPDDR3

Address management

Single application Multiple applications

Host/User

App1()App2()App3()

Accelerator

Flashv

isor

Parallel Execution

Kernel

0Ke

rnel

1Ke

rnel

2Ke

rnel

0Ke

rnel

1Ke

rnel

2Ke

rnel

0Ke

rnel

1Ke

rnel

2

StorageLPDDR3 FP

GAFP

GAFP

GA

Address management

Parallel execution model:

Page 33: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power

Tier‐1 Network

Tier‐2 Network

Inside Accelerator

Flash backbone

FPGA

Ctrle

r FlashFlashFlash

FPGA

Ctrle

r FlashFlashFlash

LWP0 LWP1 LWPn

PCIeControllerN

orth 

Bridge Scratch

padShared Mem(DDR3L)PSC

LWP0

Page 34: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power

Programming model

HOSTINT

1 1PCIe

Flashvisor

download2 DRAM

sleep3 PSCinvoke54 3

LWPload6

5

Kernel offloadKernel scheduleKernel execution

Page 35: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power

Coarse-granule SchedulingInter-kernel static scheduling (InterSt):• Bind a user application to a specific LWP.• Pros: equivalent, no starvation• Cons: low core utilization

Inter-kernel dynamic scheduling (InterDy):• Flashvisor schedules kernels to LWPs which are in idle.• Pros: good performance when kernels are sufficient• Cons: poor performance when kernels are few

Page 36: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power

Kernel Scheduling StrategiesInter-kernel scheduling (static):

k1k0App0

App2 k2 k3

T0Arrive Time

LWP0

LWP1LWP2LWP3

T0 T1 T2 T3 T4 T5 T6 T7

k0 k1

k2 k3

k0 LATENCYk1 LATENCYk2 LATENCYk3 LATENCY

Inter-kernel scheduling (dynamic):LWP0

LWP1LWP2LWP3

k0

k1

T0 T1 T2 T3 T4 T5 T6 T7

k2k3 SAVED

k0 LATENCYk1 LATk2 LATENCYk3 LATENCY

SAVED

Page 37: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power

Kernel Scheduling StrategiesSolution: partition kernel into microblocks:

_fict_[0]ey[0][j] = FOR  j = 0..3 

ENDFORFOR  i = 0..3 FOR  j = 0..3 

ey[i][j] = ENDFORFOR  j = 1..3 

ENDFORENDFORFOR  i = 0..3 FOR  j = 0..3 

ENDFORENDFOR

Microblock 0

Microblock 1

Microblock 2

screen

An example of FDTD-2D

Page 38: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power

Kernel Scheduling StrategiesIntra-kernel scheduling (in-order):

k0App0

App2T0 Arrive Time

k0 k11 2 a b 1 a

k2 k3k01 2 a k01 2 a

k01 2 Microblock 0a b Microblock 1

b

LWP0LWP1LWP2LWP3

T0 T1 T2 T3 T4 T5 T6 T7

k012

k0ab

k01 k0a k012

k0a k012

k0ak0b

k0 LAT.k1 LATENCYk2 LATENCYk3 LATENCY

SAVEDSAVED

Page 39: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power

Kernel Scheduling StrategiesIntra-kernel scheduling (out-of-order):

LWP0LWP1LWP2LWP3

T0 T1 T2 T3 T4 T5 T6 T7

k012

k0ab

k01 k0a1

2 k0ak012

k0ak0b

k0 LAT.k1 LAT.k2 LATENCYk3 LATENCY

SAVEDSAVED

SAVED

k0App0

App2T0 Arrive Time

k0 k11 2 a b 1 a

k2 k3k01 2 a k01 2 a

k01 2 Microblock 0a b Microblock 1

b

Page 40: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power

Fine-granule SchedulingIntra-kernel In-order scheduling (IntraIo):• Execute kernels in serial and schedule microblocks across all

LWPs.• Pros: reduce the complexity of microblock scheduling• Cons: cannot maximize core utilization

k0App0

App2T0 Arrive Time

k0 k11 2 a b 1 a

k2 k3k01 2 a k01 2 a

k01 2 Microblock 0a b Microblock 1

b

LWP0LWP1LWP2LWP3

T0 T1 T2 T3 T4 T5 T6 T7

k012

k0ab

k01 k0a k012

k0a k012

k0ak0b

k0 LAT.k1 LATENCYk2 LATENCYk3 LATENCY

SAVEDSAVED

Page 41: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power

EvaluationThroughput

InterSt/IntraIo is better than SIMD, due to the integration of accelerator and NAND flash.

InterSt/IntraIo is better than SIMD, due to the integration of accelerator and NAND flash.

InterDy/IntraO3 perform better than InterSt/IntraIo, because dynamic scheduling can improve core utilization.

InterDy/IntraO3 perform better than InterSt/IntraIo, because dynamic scheduling can improve core utilization.

Page 42: FlashAbacus: A Self-Governing Flash-Based Accelerator for ...camelab.org/uploads/Main/flashabacus.pdf · Intel 750 SSD DRAM CPU Xeon Phi 22W 7W 91W 300W Abacus 6W 10W NAND Flash low‐power

EvaluationEnergy

InterDy/IntraO3 achieve SSD access energy breakdown similar to SIMD, as they access same amount of data.

InterDy/IntraO3 achieve SSD access energy breakdown similar to SIMD, as they access same amount of data.

InterDy/IntraO3 cost computation energy even less than SIMD, asdynamically scheduling ensures kernels can be executed in parallel.InterDy/IntraO3 cost computation energy even less than SIMD, asdynamically scheduling ensures kernels can be executed in parallel.

FlashAbacus drastically reduce the energy of data movement.FlashAbacus drastically reduce the energy of data movement.