FlashAbacus: A Self-Governing Flash-Based Accelerator for...

FlashAbacus:A Self-Governing Flash-Based

Accelerator for Low-Power Systems

Jie Zhang and Myoungsoo JungComputer Architecture and Memory Systems Lab

Executable Summary

Traditional heterogeneous compute system• Long data path between accelerator and storage;• Accelerators cost high power;

Intel 750 SSD DRAM CPU Xeon Phi

22W 7W 91W 300W

Abacus

10W6W

NAND Flash

low‐power

No data movement

Major ResultsPerformance: 127% better than traditional heterogeneous system.Energy: reduce 78% of energy compared to traditional approach.

FlashAbacus

Example: Top-500 HPC trendsSystem

s using

cop

rocessor/accelerators

18%Accelerator is a promising solution, but it also faces several challenges

power consumption

Challenge1: power consumption

The power consumption renders it difficult from being accepted in low-power system.

300W

180W

20W

Challenge2: data movement overhead

32% storage

23% movement

45% computation

Challenge2: data movement overhead

17% storage

64% movement

19% computation

Challenge2: data movementDiscrete Hardware:

i) Storage to device memory

ii) Device to host‐side DRAM

DRAMDRAMDRAMMain CPU

IO Controller

DRAM

EMPs

MemoryNorthbridgeCache

Storage MediaSSD

Accelerator

SSD

IO Controller

NorthbridgeDRAMDRAMDRAMDRAM

iii) Host‐side DRAM to user process

DRAMDRAMDRAMDRAM Northbridge

Main CPU

vi) User process to accelerator DRAM

Main CPU

Northbridge Memory

Challenge2: data movementDiscrete Software Stack:

User Space

Kernel Space

Device Space

Data‐intensive Application

Acc. RuntimeI/O Runtime

File SystemAcc. DriverHBA Driver

AcceleratorSSDStorage S/W Stack Acc. S/W Stack

Firmware

HBA Driver

Firmware

SSD

HBA DriverFile System

I/O Runtime

Data‐intensive Application

Acc. Runtime

Acc. DriverAcc. Driver

Accelerator

Challenge3: accelerator utilization

Low-power compute system is sensitive to the serial program codes.

Challenge3: accelerator utilization

79% 76%

FlashAbacusOur solution--FlashAbacus:i. Reduce power consumption;ii. Eliminating redundant data copy and long data path;iii.Improve core utilization;

power consumption

300W180W

20WFlashAbacus

FlashAbacus

A glance of hardwareMany‐core Host

Memory

NorthBridge

Core Core CoreCore Core Core

IO Controller

SSD

EMPsCache

Memory flashflash

corecoreProcessor

Flash

Accelerator

Storage

Heterogeneous Platform

Our PlatformAccelerator

Tier‐1 Network

Tier‐2 NetworkNetwork

Inside Accelerator

Flash backbone

FPGA

Ctrle

r FlashFlashFlash

FPGA

Ctrle

r FlashFlashFlash

LWP0 LWP1 LWPn

PCIeControllerN

orth

Bridge Scratch

padShared Mem(DDR3L)PSC

Flash‐based Storage

GPDSPCores

PeripheralComponents

Flashvisor

Storengine

Kernelexe.

Programming model

kernelGen

loop(optional)

kernelExe

dataSave (optional)

HostAccelerator

kernelOffload

fopen()malloc()

loop

Acc‐Malloc()

fread()Acc‐Memcpy()Acc‐kernel()

fwrite()Acc‐Memcpy()

free()

fclose()

I/O Runtime

Acc‐Free()

Epilogue

Prologue

Body

Acc. Runtimefopen()malloc()

loop

Acc‐Malloc()

fread()Acc‐Memcpy()Acc‐kernel()

fwrite()Acc‐Memcpy()

free()

fclose()

I/O Runtime

Acc‐Free()

Epilogue

Prologue

Body

Acc. Runtime

Traditional Programming Model

Traditional Programming Model

FlashAbacusProgramming Model

parallelserialserial

Software Development

• Fuse flash in a multi-core system• Parallel kernel execution

Fuse flash in a multi-core system

Data access model

LWP

L2$ b

a

DRAM

c

Flash

?Storage access w/o OS?Storage management?

Flash VirtualizationFlashvisor: No OS/FS • Directly expose flash address space to LWPs.• Map flash address space to internal DRAM.

Manage storage access • Maintain a simplified page mapping table.• Translate from LBA to PPN.

Protection & access control• Maintain a range lock for parallel data access.

Storengine: manage flash background tasks such as garbage collection and log dumping.

Flash VirtualizationRead

KernelMessage1

Flashvisor

2 Lock inquiry

Range lock

Page table lookup3Scratchpad

4 I/O

FPGA

FlashFlashFlash

5 DMA

LPDDR36 Read

Ch# Page group#

Page Table

Inde

x

pkg#

Logical Address

Address Translation

Physical Address

StartPage

StartPage

StartPage

StartPage

Search

Startpage

StartPage

StartPage

RB tree

Flash VirtualizationWrite

FPGAFlashFlashFlash

Kernel 1 Write LPDDR3Message2

Flashvisor

Lock inquiry3

Range lock

I/O5

4 Reclaim blockStoregine

Garbagecollection

Page table snapshot

DMA5Page table update6Scratchpad

Software Development

• Fuse flash in a multi-core system• Parallel kernel execution

Parallel Kernel Execution

FlashAbacus

Host/User

App1()App2()App3()

Accelerator

Flashv

isor

Parallel Execution

Kernel

0Ke

rnel

1Ke

rnel

2Ke

rnel

0Ke

rnel

1Ke

rnel

2Ke

rnel

0Ke

rnel

1Ke

rnel

2

StorageLPDDR3 FP

GAFP

GAFP

GA

Address management

Parallel execution model:Master thread

Conventional

Require OS thread managementHost-accelerator communication

No hostintervention

Coarse-granule SchedulingInter-kernel static scheduling (InterSt):• Bind a user application to a specific LWP.

Inter-kernel dynamic scheduling (InterDy):• Flashvisor schedules kernels to LWPs which are in idle.

k1k0App0

App2 k2 k3

T0Arrive Time

LWP0

LWP1LWP2LWP3

T0 T1 T2 T3 T4 T5 T6 T7

k0 k1

k2 k3

k0 LATENCYk1 LATENCYk2 LATENCYk3 LATENCY

LWP0

LWP1LWP2LWP3

k0

k1

T0 T1 T2 T3 T4 T5 T6 T7

k2k3 SAVED

k0 LATENCYk1 LATk2 LATENCYk3 LATENCY

SAVED

Fine-granule SchedulingPartition kernel into microblocks:

An example of FDTD-2D

_fict_[0]ey[0][j] = FOR j = 0..3

ENDFORFOR i = 0..3 FOR j = 0..3

ey[i][j] = ENDFORFOR j = 1..3

ENDFORENDFORFOR i = 0..3 FOR j = 0..3

ENDFORENDFOR

screen

Kernel

Microblock 0

Microblock 1

Microblock 2

Fine-granule SchedulingIntra-kernel Out-of-order scheduling (IntraO3):• Schedule microblocks from all kernels across LWPs.• Pros: maximize core utilization• Cons: make sure running microblocks have no dependency

LWP0LWP1LWP2LWP3

T0 T1 T2 T3 T4 T5 T6 T7

k012

k0ab

k01 k0a1

2 k0ak012

k0ak0b

k0 LAT.k1 LAT.k2 LATENCYk3 LATENCY

SAVEDSAVED

SAVED

k0App0

App2T0 Arrive Time

k0 k11 2 a b 1 a

k2 k3k01 2 a k01 2 a

k01 2 Microblock 0a b Microblock 1

b

Experiment SetupSystem configuration:

Host Xeon 2620‐v3

LWPs 8 @ 1GHz

SSD access latency Read Lat.=25us, Write Lat.=800us

Workloads Polybench benchmark suits

Accelerator Configuration:• SIMD: use OpenMP and has discrete storage and accelerator;• InterSt: FlashAbacus with static inter‐kernel scheduling;• InterDy: FlashAbacus with dynamic inter‐kernel scheduling;• InterO3: FlashAbacus with out‐of‐order intra‐kernel scheduling.

EvaluationTime series analysis

IntraO3 has shorter storage access time than SIMD, as it eliminate the data movement overhead.

IntraO3 has shorter storage access time than SIMD, as it eliminate the data movement overhead.

IntraO3 has shorter compute time, because dynamic scheduling can improve core utilization.

IntraO3 has shorter compute time, because dynamic scheduling can improve core utilization.

Storage Access

Storage Access

ComputeCompute

EvaluationEnergy

FlashAbacus drastically reduce the energy of data movement.FlashAbacus drastically reduce the energy of data movement.

Thank you

Backup

Performance Evolution in Computing

Single‐Core Era

Constrained by: Power Complexity

Multi‐Core Era

Constrained by: Power Scalability

HeterogeneousSystem Era

Enabled by: Data parallelism High‐performance

acceleratorIntel Xeon‐phi

GPGPU

Challenge2: data movement

Storage access accounts for a large ratio of total execution time.

Parallel Kernel Executionmanage the kernel scheduling to maximize execution throughput of all LWPs.

App()

Host/User

Kernel 0

Kernel 1

Kernel n

Parallel Execution

Kernel 2

Flashv

isor

FPGA

FPGA

FPGA

StorageLPDDR3

Address management

Single application Multiple applications

Host/User

App1()App2()App3()

Accelerator

Flashv

isor

Parallel Execution

Kernel

0Ke

rnel

1Ke

rnel

2Ke

rnel

0Ke

rnel

1Ke

rnel

2Ke

rnel

0Ke

rnel

1Ke

rnel

2

StorageLPDDR3 FP

GAFP

GAFP

GA

Address management

Parallel execution model:

Tier‐1 Network

Tier‐2 Network

Inside Accelerator

Flash backbone

FPGA

Ctrle

r FlashFlashFlash

FPGA

Ctrle

r FlashFlashFlash

LWP0 LWP1 LWPn

PCIeControllerN

orth

Bridge Scratch

padShared Mem(DDR3L)PSC

LWP0

Programming model

HOSTINT

1 1PCIe

Flashvisor

download2 DRAM

sleep3 PSCinvoke54 3

LWPload6

5

Kernel offloadKernel scheduleKernel execution

Coarse-granule SchedulingInter-kernel static scheduling (InterSt):• Bind a user application to a specific LWP.• Pros: equivalent, no starvation• Cons: low core utilization

Inter-kernel dynamic scheduling (InterDy):• Flashvisor schedules kernels to LWPs which are in idle.• Pros: good performance when kernels are sufficient• Cons: poor performance when kernels are few

Kernel Scheduling StrategiesInter-kernel scheduling (static):

k1k0App0

App2 k2 k3

T0Arrive Time

LWP0

LWP1LWP2LWP3

T0 T1 T2 T3 T4 T5 T6 T7

k0 k1

k2 k3

k0 LATENCYk1 LATENCYk2 LATENCYk3 LATENCY

Inter-kernel scheduling (dynamic):LWP0

LWP1LWP2LWP3

k0

k1

T0 T1 T2 T3 T4 T5 T6 T7

k2k3 SAVED

k0 LATENCYk1 LATk2 LATENCYk3 LATENCY

SAVED

Kernel Scheduling StrategiesSolution: partition kernel into microblocks:

_fict_[0]ey[0][j] = FOR j = 0..3

ENDFORFOR i = 0..3 FOR j = 0..3

ey[i][j] = ENDFORFOR j = 1..3

ENDFORENDFORFOR i = 0..3 FOR j = 0..3

ENDFORENDFOR

Microblock 0

Microblock 1

Microblock 2

screen

An example of FDTD-2D

Kernel Scheduling StrategiesIntra-kernel scheduling (in-order):

k0App0

App2T0 Arrive Time

k0 k11 2 a b 1 a

k2 k3k01 2 a k01 2 a


b

LWP0LWP1LWP2LWP3

T0 T1 T2 T3 T4 T5 T6 T7

k012

k0ab

k01 k0a k012

k0a k012

k0ak0b

k0 LAT.k1 LATENCYk2 LATENCYk3 LATENCY

SAVEDSAVED

Kernel Scheduling StrategiesIntra-kernel scheduling (out-of-order):

LWP0LWP1LWP2LWP3

T0 T1 T2 T3 T4 T5 T6 T7

k012

k0ab

k01 k0a1

2 k0ak012

k0ak0b

k0 LAT.k1 LAT.k2 LATENCYk3 LATENCY

SAVEDSAVED

SAVED

k0App0

App2T0 Arrive Time

k0 k11 2 a b 1 a

k2 k3k01 2 a k01 2 a


b

Fine-granule SchedulingIntra-kernel In-order scheduling (IntraIo):• Execute kernels in serial and schedule microblocks across all

LWPs.• Pros: reduce the complexity of microblock scheduling• Cons: cannot maximize core utilization

k0App0

App2T0 Arrive Time

k0 k11 2 a b 1 a

k2 k3k01 2 a k01 2 a


b

LWP0LWP1LWP2LWP3

T0 T1 T2 T3 T4 T5 T6 T7

k012

k0ab

k01 k0a k012

k0a k012

k0ak0b

k0 LAT.k1 LATENCYk2 LATENCYk3 LATENCY

SAVEDSAVED

EvaluationThroughput

InterSt/IntraIo is better than SIMD, due to the integration of accelerator and NAND flash.

InterSt/IntraIo is better than SIMD, due to the integration of accelerator and NAND flash.

InterDy/IntraO3 perform better than InterSt/IntraIo, because dynamic scheduling can improve core utilization.

InterDy/IntraO3 perform better than InterSt/IntraIo, because dynamic scheduling can improve core utilization.

EvaluationEnergy

InterDy/IntraO3 achieve SSD access energy breakdown similar to SIMD, as they access same amount of data.

InterDy/IntraO3 achieve SSD access energy breakdown similar to SIMD, as they access same amount of data.

InterDy/IntraO3 cost computation energy even less than SIMD, asdynamically scheduling ensures kernels can be executed in parallel.InterDy/IntraO3 cost computation energy even less than SIMD, asdynamically scheduling ensures kernels can be executed in parallel.

FlashAbacus drastically reduce the energy of data movement.FlashAbacus drastically reduce the energy of data movement.

FlashAbacus: A Self-Governing Flash-Based Accelerator for...

Documents

Transcript of FlashAbacus: A Self-Governing Flash-Based Accelerator for...