FlashAbacus: A Self-Governing Flash-Based Accelerator for...
Transcript of FlashAbacus: A Self-Governing Flash-Based Accelerator for...
FlashAbacus:A Self-Governing Flash-Based
Accelerator for Low-Power Systems
Jie Zhang and Myoungsoo JungComputer Architecture and Memory Systems Lab
Executable Summary
Traditional heterogeneous compute system• Long data path between accelerator and storage;• Accelerators cost high power;
Intel 750 SSD DRAM CPU Xeon Phi
22W 7W 91W 300W
Abacus
10W6W
NAND Flash
low‐power
No data movement
Major ResultsPerformance: 127% better than traditional heterogeneous system.Energy: reduce 78% of energy compared to traditional approach.
FlashAbacus
Example: Top-500 HPC trendsSystem
s using
cop
rocessor/accelerators
18%Accelerator is a promising solution, but it also faces several challenges
power consumption
Challenge1: power consumption
The power consumption renders it difficult from being accepted in low-power system.
300W
180W
20W
Challenge2: data movement overhead
32% storage
23% movement
45% computation
Challenge2: data movement overhead
17% storage
64% movement
19% computation
Challenge2: data movementDiscrete Hardware:
i) Storage to device memory
ii) Device to host‐side DRAM
DRAMDRAMDRAMMain CPU
IO Controller
DRAM
EMPs
MemoryNorthbridgeCache
Storage MediaSSD
Accelerator
SSD
IO Controller
NorthbridgeDRAMDRAMDRAMDRAM
iii) Host‐side DRAM to user process
DRAMDRAMDRAMDRAM Northbridge
Main CPU
vi) User process to accelerator DRAM
Main CPU
Northbridge Memory
Challenge2: data movementDiscrete Software Stack:
User Space
Kernel Space
Device Space
Data‐intensive Application
Acc. RuntimeI/O Runtime
File SystemAcc. DriverHBA Driver
AcceleratorSSDStorage S/W Stack Acc. S/W Stack
Firmware
HBA Driver
Firmware
SSD
HBA DriverFile System
I/O Runtime
Data‐intensive Application
Acc. Runtime
Acc. DriverAcc. Driver
Accelerator
Challenge3: accelerator utilization
Low-power compute system is sensitive to the serial program codes.
Challenge3: accelerator utilization
79% 76%
FlashAbacusOur solution--FlashAbacus:i. Reduce power consumption;ii. Eliminating redundant data copy and long data path;iii.Improve core utilization;
power consumption
300W180W
20WFlashAbacus
FlashAbacus
A glance of hardwareMany‐core Host
Memory
NorthBridge
Core Core CoreCore Core Core
IO Controller
SSD
EMPsCache
Memory flashflash
corecoreProcessor
Flash
Accelerator
Storage
Heterogeneous Platform
Our PlatformAccelerator
Tier‐1 Network
Tier‐2 NetworkNetwork
Inside Accelerator
Flash backbone
FPGA
Ctrle
r FlashFlashFlash
FPGA
Ctrle
r FlashFlashFlash
LWP0 LWP1 LWPn
PCIeControllerN
orth
Bridge Scratch
padShared Mem(DDR3L)PSC
Flash‐based Storage
GPDSPCores
PeripheralComponents
Flashvisor
Storengine
Kernelexe.
Programming model
kernelGen
loop(optional)
kernelExe
dataSave (optional)
HostAccelerator
kernelOffload
fopen()malloc()
loop
Acc‐Malloc()
fread()Acc‐Memcpy()Acc‐kernel()
fwrite()Acc‐Memcpy()
free()
fclose()
I/O Runtime
Acc‐Free()
Epilogue
Prologue
Body
Acc. Runtimefopen()malloc()
loop
Acc‐Malloc()
fread()Acc‐Memcpy()Acc‐kernel()
fwrite()Acc‐Memcpy()
free()
fclose()
I/O Runtime
Acc‐Free()
Epilogue
Prologue
Body
Acc. Runtime
Traditional Programming Model
Traditional Programming Model
FlashAbacusProgramming Model
parallelserialserial
Software Development
• Fuse flash in a multi-core system• Parallel kernel execution
Fuse flash in a multi-core system
Data access model
LWP
L2$ b
a
DRAM
c
Flash
?Storage access w/o OS?Storage management?
Flash VirtualizationFlashvisor: No OS/FS • Directly expose flash address space to LWPs.• Map flash address space to internal DRAM.
Manage storage access • Maintain a simplified page mapping table.• Translate from LBA to PPN.
Protection & access control• Maintain a range lock for parallel data access.
Storengine: manage flash background tasks such as garbage collection and log dumping.
Flash VirtualizationRead
KernelMessage1
Flashvisor
2 Lock inquiry
Range lock
Page table lookup3Scratchpad
4 I/O
FPGA
FlashFlashFlash
5 DMA
LPDDR36 Read
Ch# Page group#
Page Table
Inde
x
pkg#
Logical Address
Address Translation
Physical Address
StartPage
StartPage
StartPage
StartPage
Search
Startpage
StartPage
StartPage
RB tree
Flash VirtualizationWrite
FPGAFlashFlashFlash
Kernel 1 Write LPDDR3Message2
Flashvisor
Lock inquiry3
Range lock
I/O5
4 Reclaim blockStoregine
Garbagecollection
Page table snapshot
DMA5Page table update6Scratchpad
Software Development
• Fuse flash in a multi-core system• Parallel kernel execution
Parallel Kernel Execution
FlashAbacus
Host/User
App1()App2()App3()
Accelerator
Flashv
isor
Parallel Execution
Kernel
0Ke
rnel
1Ke
rnel
2Ke
rnel
0Ke
rnel
1Ke
rnel
2Ke
rnel
0Ke
rnel
1Ke
rnel
2
StorageLPDDR3 FP
GAFP
GAFP
GA
Address management
Parallel execution model:Master thread
Conventional
Require OS thread managementHost-accelerator communication
No hostintervention
Coarse-granule SchedulingInter-kernel static scheduling (InterSt):• Bind a user application to a specific LWP.
Inter-kernel dynamic scheduling (InterDy):• Flashvisor schedules kernels to LWPs which are in idle.
k1k0App0
App2 k2 k3
T0Arrive Time
LWP0
LWP1LWP2LWP3
T0 T1 T2 T3 T4 T5 T6 T7
k0 k1
k2 k3
k0 LATENCYk1 LATENCYk2 LATENCYk3 LATENCY
LWP0
LWP1LWP2LWP3
k0
k1
T0 T1 T2 T3 T4 T5 T6 T7
k2k3 SAVED
k0 LATENCYk1 LATk2 LATENCYk3 LATENCY
SAVED
Fine-granule SchedulingPartition kernel into microblocks:
An example of FDTD-2D
_fict_[0]ey[0][j] = FOR j = 0..3
ENDFORFOR i = 0..3 FOR j = 0..3
ey[i][j] = ENDFORFOR j = 1..3
ENDFORENDFORFOR i = 0..3 FOR j = 0..3
ENDFORENDFOR
screen
Kernel
Microblock 0
Microblock 1
Microblock 2
Fine-granule SchedulingIntra-kernel Out-of-order scheduling (IntraO3):• Schedule microblocks from all kernels across LWPs.• Pros: maximize core utilization• Cons: make sure running microblocks have no dependency
LWP0LWP1LWP2LWP3
T0 T1 T2 T3 T4 T5 T6 T7
k012
k0ab
k01 k0a1
2 k0ak012
k0ak0b
k0 LAT.k1 LAT.k2 LATENCYk3 LATENCY
SAVEDSAVED
SAVED
k0App0
App2T0 Arrive Time
k0 k11 2 a b 1 a
k2 k3k01 2 a k01 2 a
k01 2 Microblock 0a b Microblock 1
b
Experiment SetupSystem configuration:
Host Xeon 2620‐v3
LWPs 8 @ 1GHz
SSD access latency Read Lat.=25us, Write Lat.=800us
Workloads Polybench benchmark suits
Accelerator Configuration:• SIMD: use OpenMP and has discrete storage and accelerator;• InterSt: FlashAbacus with static inter‐kernel scheduling;• InterDy: FlashAbacus with dynamic inter‐kernel scheduling;• InterO3: FlashAbacus with out‐of‐order intra‐kernel scheduling.
EvaluationTime series analysis
IntraO3 has shorter storage access time than SIMD, as it eliminate the data movement overhead.
IntraO3 has shorter storage access time than SIMD, as it eliminate the data movement overhead.
IntraO3 has shorter compute time, because dynamic scheduling can improve core utilization.
IntraO3 has shorter compute time, because dynamic scheduling can improve core utilization.
Storage Access
Storage Access
ComputeCompute
EvaluationEnergy
FlashAbacus drastically reduce the energy of data movement.FlashAbacus drastically reduce the energy of data movement.
Thank you
Backup
Performance Evolution in Computing
Single‐Core Era
Constrained by: Power Complexity
Multi‐Core Era
Constrained by: Power Scalability
HeterogeneousSystem Era
Enabled by: Data parallelism High‐performance
acceleratorIntel Xeon‐phi
GPGPU
Challenge2: data movement
Storage access accounts for a large ratio of total execution time.
Parallel Kernel Executionmanage the kernel scheduling to maximize execution throughput of all LWPs.
App()
Host/User
Kernel 0
Kernel 1
Kernel n
Parallel Execution
Kernel 2
Flashv
isor
FPGA
FPGA
FPGA
StorageLPDDR3
Address management
Single application Multiple applications
Host/User
App1()App2()App3()
Accelerator
Flashv
isor
Parallel Execution
Kernel
0Ke
rnel
1Ke
rnel
2Ke
rnel
0Ke
rnel
1Ke
rnel
2Ke
rnel
0Ke
rnel
1Ke
rnel
2
StorageLPDDR3 FP
GAFP
GAFP
GA
Address management
Parallel execution model:
Tier‐1 Network
Tier‐2 Network
Inside Accelerator
Flash backbone
FPGA
Ctrle
r FlashFlashFlash
FPGA
Ctrle
r FlashFlashFlash
LWP0 LWP1 LWPn
PCIeControllerN
orth
Bridge Scratch
padShared Mem(DDR3L)PSC
LWP0
Programming model
HOSTINT
1 1PCIe
Flashvisor
download2 DRAM
sleep3 PSCinvoke54 3
LWPload6
5
Kernel offloadKernel scheduleKernel execution
Coarse-granule SchedulingInter-kernel static scheduling (InterSt):• Bind a user application to a specific LWP.• Pros: equivalent, no starvation• Cons: low core utilization
Inter-kernel dynamic scheduling (InterDy):• Flashvisor schedules kernels to LWPs which are in idle.• Pros: good performance when kernels are sufficient• Cons: poor performance when kernels are few
Kernel Scheduling StrategiesInter-kernel scheduling (static):
k1k0App0
App2 k2 k3
T0Arrive Time
LWP0
LWP1LWP2LWP3
T0 T1 T2 T3 T4 T5 T6 T7
k0 k1
k2 k3
k0 LATENCYk1 LATENCYk2 LATENCYk3 LATENCY
Inter-kernel scheduling (dynamic):LWP0
LWP1LWP2LWP3
k0
k1
T0 T1 T2 T3 T4 T5 T6 T7
k2k3 SAVED
k0 LATENCYk1 LATk2 LATENCYk3 LATENCY
SAVED
Kernel Scheduling StrategiesSolution: partition kernel into microblocks:
_fict_[0]ey[0][j] = FOR j = 0..3
ENDFORFOR i = 0..3 FOR j = 0..3
ey[i][j] = ENDFORFOR j = 1..3
ENDFORENDFORFOR i = 0..3 FOR j = 0..3
ENDFORENDFOR
Microblock 0
Microblock 1
Microblock 2
screen
An example of FDTD-2D
Kernel Scheduling StrategiesIntra-kernel scheduling (in-order):
k0App0
App2T0 Arrive Time
k0 k11 2 a b 1 a
k2 k3k01 2 a k01 2 a
k01 2 Microblock 0a b Microblock 1
b
LWP0LWP1LWP2LWP3
T0 T1 T2 T3 T4 T5 T6 T7
k012
k0ab
k01 k0a k012
k0a k012
k0ak0b
k0 LAT.k1 LATENCYk2 LATENCYk3 LATENCY
SAVEDSAVED
Kernel Scheduling StrategiesIntra-kernel scheduling (out-of-order):
LWP0LWP1LWP2LWP3
T0 T1 T2 T3 T4 T5 T6 T7
k012
k0ab
k01 k0a1
2 k0ak012
k0ak0b
k0 LAT.k1 LAT.k2 LATENCYk3 LATENCY
SAVEDSAVED
SAVED
k0App0
App2T0 Arrive Time
k0 k11 2 a b 1 a
k2 k3k01 2 a k01 2 a
k01 2 Microblock 0a b Microblock 1
b
Fine-granule SchedulingIntra-kernel In-order scheduling (IntraIo):• Execute kernels in serial and schedule microblocks across all
LWPs.• Pros: reduce the complexity of microblock scheduling• Cons: cannot maximize core utilization
k0App0
App2T0 Arrive Time
k0 k11 2 a b 1 a
k2 k3k01 2 a k01 2 a
k01 2 Microblock 0a b Microblock 1
b
LWP0LWP1LWP2LWP3
T0 T1 T2 T3 T4 T5 T6 T7
k012
k0ab
k01 k0a k012
k0a k012
k0ak0b
k0 LAT.k1 LATENCYk2 LATENCYk3 LATENCY
SAVEDSAVED
EvaluationThroughput
InterSt/IntraIo is better than SIMD, due to the integration of accelerator and NAND flash.
InterSt/IntraIo is better than SIMD, due to the integration of accelerator and NAND flash.
InterDy/IntraO3 perform better than InterSt/IntraIo, because dynamic scheduling can improve core utilization.
InterDy/IntraO3 perform better than InterSt/IntraIo, because dynamic scheduling can improve core utilization.
EvaluationEnergy
InterDy/IntraO3 achieve SSD access energy breakdown similar to SIMD, as they access same amount of data.
InterDy/IntraO3 achieve SSD access energy breakdown similar to SIMD, as they access same amount of data.
InterDy/IntraO3 cost computation energy even less than SIMD, asdynamically scheduling ensures kernels can be executed in parallel.InterDy/IntraO3 cost computation energy even less than SIMD, asdynamically scheduling ensures kernels can be executed in parallel.
FlashAbacus drastically reduce the energy of data movement.FlashAbacus drastically reduce the energy of data movement.