Scaling Datacenter Accelerators With Compute-Reuse Architectures
Adi Fuchs and David Wentzlaff
ISCA 2018 Session 5AJune 5, 2018 Los Angeles, CA
Scaling Datacenter Accelerators With Compute-Reuse Architectures
2
Sources:"Cramming more components onto integrated circuits” GE Moore, Computer 1965“Next-Gen Power Solutions for Hyperscale Data Centers”, DataCenter Knowledge 2016
Scaling Datacenter Accelerators With Compute-Reuse Architectures
Scaling Datacenter Accelerators With Compute-Reuse Architectures
3
Sources:"Cramming more components onto integrated circuits” GE Moore, Computer 1965“Next-Gen Power Solutions for Hyperscale Data Centers”, DataCenter Knowledge 2016
Scaling Datacenter Accelerators With Compute-Reuse Architectures
Scaling Datacenter Accelerators With Compute-Reuse Architectures
4
Sources:"Cramming more components onto integrated circuits” GE Moore, Computer 1965“Next-Gen Power Solutions for Hyperscale Data Centers”, DataCenter Knowledge 2016
Scaling Datacenter Accelerators With Compute-Reuse Architectures
?
Scaling Datacenter Accelerators With Compute-Reuse Architectures
5
Sources:“Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective”, Hazelwood et al. HPCA 2018“Cloud TPU”, Google, https://cloud.google.com/tpu/“FPGA Accelerated Computing Using AWS F1 Instances”, David Pellerin, AWS summit 2017“Microsoft unveils Project Brainwave for real-time AI“, Doug Burger, https://www.microsoft.com/en-us/research/blog/microsoft-unveils-project-brainwave/“NVIDIA TESLA V100“, NVIDIA, https://www.nvidia.com/en-us/data-center/tesla-v100/
Scaling Datacenter Accelerators With Compute-Reuse Architectures
Scaling Datacenter Accelerators With Compute-Reuse Architectures
6
Sources:“Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective”, Hazelwood et al. HPCA 2018“Cloud TPU”, Google, https://cloud.google.com/tpu/“FPGA Accelerated Computing Using AWS F1 Instances”, David Pellerin, AWS summit 2017“Microsoft unveils Project Brainwave for real-time AI“, Doug Burger, https://www.microsoft.com/en-us/research/blog/microsoft-unveils-project-brainwave/“NVIDIA TESLA V100“, NVIDIA, https://www.nvidia.com/en-us/data-center/tesla-v100/
Scaling Datacenter Accelerators With Compute-Reuse Architectures
Scaling Datacenter Accelerators With Compute-Reuse Architectures
7
Sources:“Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective”, Hazelwood et al. HPCA 2018“Cloud TPU”, Google, https://cloud.google.com/tpu/“FPGA Accelerated Computing Using AWS F1 Instances”, David Pellerin, AWS summit 2017“Microsoft unveils Project Brainwave for real-time AI“, Doug Burger, https://www.microsoft.com/en-us/research/blog/microsoft-unveils-project-brainwave/“NVIDIA TESLA V100“, NVIDIA, https://www.nvidia.com/en-us/data-center/tesla-v100/
Scaling Datacenter Accelerators With Compute-Reuse Architectures
Transistor scaling stops. Chip specialization runs out of steam.
What’s Next?
Scaling Datacenter Accelerators With Compute-Reuse Architectures
8
Observation I: The Density of Emerging Memories are Projected to Increase
Scaling Datacenter Accelerators With Compute-Reuse Architectures
ITRS Logic Roadmap
Scaling Datacenter Accelerators With Compute-Reuse Architectures
9
Source:”Face recognition in unconstrained videos with matched background similarity”, Wolf et al., CVPR 2011
Observation II: Datacenter Accelerators Perform Redundant Computations
▪ Temporal locality introduces redundancy in videos encoders (recurrent blocks in white)
t=0 sec t=2 sec t=4 sec
Scaling Datacenter Accelerators With Compute-Reuse Architectures
10
Source:”Face recognition in unconstrained videos with matched background similarity”, Wolf et al., CVPR 2011
Observation II: Datacenter Accelerators Perform Redundant Computations
▪ Temporal locality introduces redundancy in videos encoders (recurrent blocks in white)
t=0 sec
0% recurrence 38% recurrence 61% recurrence
t=2 sec t=4 sec
Scaling Datacenter Accelerators With Compute-Reuse Architectures
11
Observation II: Datacenter Accelerators Perform Redundant Computations
▪ Search term commonality retrieves the similar content
intercontinental downtown los angeles
Source: Google
Scaling Datacenter Accelerators With Compute-Reuse Architectures
12
Observation II: Datacenter Accelerators Perform Redundant Computations
▪ Search term commonality retrieves the similar content
intercontinental downtown los angeles
Source: Google
hotel in downtown los angeles near intercontinental
Scaling Datacenter Accelerators With Compute-Reuse Architectures
13
Observation II: Datacenter Accelerators Perform Redundant Computations
▪ Search term commonality retrieves the similar content
intercontinental downtown los angeles
Source: Google
hotel in downtown los angeles near intercontinental
Scaling Datacenter Accelerators With Compute-Reuse Architectures
14
Source: Twitter
Observation II: Datacenter Accelerators Perform Redundant Computations
▪ Power laws suggest high recurrent processing of popular content
Scaling Datacenter Accelerators With Compute-Reuse Architectures
15
Source: Twitter
Observation II: Datacenter Accelerators Perform Redundant Computations
▪ Power laws suggest high recurrent processing of popular content
Memoization: Tables store past computation outputs. Reuse outputs of recurring inputs instead of recomputing.
COREx: Compute-Reuse Architecture For Accelerators
Scaling Datacenter Accelerators With Compute-Reuse Architectures
16
InputLookup
core result
DMA Engine
Accelerator Core
input
input output
Acceleration Fabric
Shared LLC / NoC
Host Processors
Scratchpad Memory
Memoization: Tables store past computation outputs. Reuse outputs of recurring inputs instead of recomputing.
COREx: Compute-Reuse Architecture For Accelerators
Scaling Datacenter Accelerators With Compute-Reuse Architectures
17
InputLookup
lookup
fetchedresult
core result
core result
DMA Engine
Accelerator Core
input
input output
Compute-Reuse Storage
Acceleration Fabric
Shared LLC / NoC
hit
Host Processors
Scratchpad Memory
Memoization: Tables store past computation outputs. Reuse outputs of recurring inputs instead of recomputing.
COREx: Compute-Reuse Architecture For Accelerators
Scaling Datacenter Accelerators With Compute-Reuse Architectures
18
InputLookup
lookup
fetchedresult
core result
core result
DMA Engine
Accelerator Core
input
input output
Compute-Reuse Storage
Acceleration Fabric
Shared LLC / NoC
hit
Host Processors
Scratchpad Memory
19
Architectural Guidelines
Accelerator Core
Specialized Compute Lanes
ScratchpadDMA
EngineGeneral-Purpose
CMP
Shared LLC
20
Architectural Guidelines
▪ Accelerators Memoization is Naturalo Little or no additional programming efforto Built-in input-compute-output flow
Accelerator Core
Specialized Compute Lanes
ScratchpadDMA
EngineGeneral-Purpose
CMP
Shared LLC
Output
Input
Compute
21
Architectural Guidelines
▪ Accelerators Memoization is Naturalo Little or no additional programming efforto Built-in input-compute-output flow
▪ But Not Straightforward!o High lookup costso Unnecessary accesses o High access costs
▪ COREx Key Ideas:o Hashing (reduce lookup costs)o Lookup filtering (fewer accesses)o Banking (reduce access costs)
Accelerator Core
Specialized Compute Lanes
ScratchpadDMA
EngineGeneral-Purpose
CMP
Shared LLC
Output
Input
Compute
22
Architectural Guidelines
▪ Accelerators Memoization is Naturalo Little or no additional programming efforto Built-in input-compute-output flow
▪ But Not Straightforward!o High lookup costso Unnecessary accesses o High access costs
▪ COREx Key Ideas:o Hashing (reduce lookup costs)o Lookup filtering (fewer accesses)o Banking (reduce access costs)
Accelerator Core
Specialized Compute Lanes
ScratchpadDMA
EngineGeneral-Purpose
CMP
Shared LLC
Output
Input
Compute
Goal: Extend Specialization with Workload-Specific Memoization
23
Accelerator Core
Specialized Compute Lanes
Scratchpad General-Purpose CMP
Shared LLC
SoC Interconnect
Mem. Chip
Func. Block
Datapath
Control
Top Level Architecture
DMA Engine
▪ New Modules:
o Input Hashing Unit (IHU)
24
Accelerator Core
Specialized Compute Lanes
Scratchpad General-Purpose CMP
Shared LLC
IHU
COREx Interconnect
SoC Interconnect
Mem. Chip
Func. Block
Datapath
Control
Top Level Architecture
DMA Engine
▪ New Modules:
o Input Hashing Unit (IHU)
o Input Lookup Unit (ILU)
25
Accelerator Core
Specialized Compute Lanes
Scratchpad General-Purpose CMP
Shared LLC
IHU
ILU
Cache Ctrl.
COREx Interconnect
SoC Interconnect
Mem. Chip
Func. Block
Datapath
Control
Top Level Architecture
DMA Engine
Hashes
AssociativeCache
▪ New Modules:
o Input Hashing Unit (IHU)
o Input Lookup Unit (ILU)
o Computation History Table (CHT)
26
Accelerator Core
Specialized Compute Lanes
Scratchpad General-Purpose CMP
Shared LLC
IHU
CHTILU
Cache Ctrl.
COREx Interconnect
SoC Interconnect
RAM-Array Ctrl.
RAM-Array Table
Mem. Chip
Func. Block
Datapath
Control AssociativeCache
Top Level Architecture
DMA Engine
Fetch
▪ New Modules:
o Input Hashing Unit (IHU)
o Input Lookup Unit (ILU)
o Computation History Table (CHT)
27
Accelerator Core
Specialized Compute Lanes
Scratchpad General-Purpose CMP
Shared LLC
IHU
CHTILU
Cache Ctrl.
COREx Interconnect
SoC Interconnect
RAM-Array Ctrl.
RAM-Array Table
Mem. Chip
Func. Block
Datapath
Control AssociativeCache
Top Level Architecture
DMA Engine
Fetch
Match Input
▪ New Modules:
o Input Hashing Unit (IHU)
o Input Lookup Unit (ILU)
o Computation History Table (CHT)
28
Accelerator Core
Specialized Compute Lanes
Scratchpad General-Purpose CMP
Shared LLC
IHU
CHTILU
Cache Ctrl.
COREx Interconnect
SoC Interconnect
RAM-Array Ctrl.
RAM-Array Table
Mem. Chip
Func. Block
Datapath
Control AssociativeCache
Top Level Architecture
DMA Engine
Use Output
Fetch
IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table.
Building COREx
Case Study: Acceleration of Video Motion Estimation
▪ Optimization Goals:
o Runtime, Energy, and Energy-Delay Product (EDP)
▪ Baseline: highly-tuned accelerators
o Sweep space for design alternatives (Aladdin)
o Find optimal accelerator design for each goal
29
IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table.
Building COREx
Case Study: Acceleration of Video Motion Estimation
▪ Optimization Goals:
o Runtime, Energy, and Energy-Delay Product (EDP)
▪ Baseline: highly-tuned accelerators
o Sweep space for design alternatives (Aladdin)
o Find optimal accelerator design for each goal
30
IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table.
Building COREx
Runtime OPT: 5.8[us]
Energy OPT: 6.2[uJ]
EDP OPT: 148.7[pJs]
Case Study: Acceleration of Video Motion Estimation
▪ Optimization Goals:
o Runtime, Energy, and Energy-Delay Product (EDP)
▪ Baseline: highly-tuned accelerators
o Sweep space for design alternatives (Aladdin)
o Find optimal accelerator design for each goal
31
32
▪ Memoization-Layers Specialization
o Extract input traces, examine hit and miss rates of different ILU/CHT sizes.
o Integrate accelerators with emerging memory based ILU+CHT, and sweep gains space.
▪ Example: Resistive RAM based COREx
Building COREx
IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table.
33
▪ Memoization-Layers Specialization
o Extract input traces, examine hit and miss rates of different ILU/CHT sizes.
o Integrate accelerators with emerging memory based ILU+CHT, and sweep gains space.
▪ Example: Resistive RAM based COREx
Building COREx
Energy Optimization: 56.6% Energy Saved.
64KB ILU, 8MB CHT
EDP Optimization:63.5% EDP Saved.
512KB ILU, 2GB CHT
Runtime Optimization:2.7x Speedup.
512KB ILU, 32GB CHT
IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table.
34
Kernel Domain Use-Case App Source Input Source and DescriptionDCT Video Encoding Video Server x264 YouTube Faces. 10 Videos, 10 Seconds, 24 FPS.
SAD Video Encoding Video Server PARBOIL YouTube Faces. 10 Videos, 10 Seconds, 24 FPS.
SNAPPY
("SNP")
Compression Web-Server Traffic
Compression
TailBench
Snappy-C
Wikipedia Abstracts. 13 Million Search Queries.
SSSP
("SSP")
Graph Processing Maps Service: Shortest
Walking Route
Internal DIMACS NYC Streets, 10 Million Zipfian Transactions.
BFS Graph Processing Online Retail MachSuite Amazon Co-Purchasing, 10 Million Zipfian Transactions.
RBM Machine Learning Collaborative Filtering CortexSuite Netflix Prize: 10 Million Zipfian Transactions.
Experimental Setup
IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table.
Workloads
35
WorkloadsKernel Domain Use-Case App Source Input Source and Description
DCT Video Encoding Video Server x264 YouTube Faces. 10 Videos, 10 Seconds, 24 FPS.
SAD Video Encoding Video Server PARBOIL YouTube Faces. 10 Videos, 10 Seconds, 24 FPS.
SNAPPY
("SNP")
Compression Web-Server Traffic
Compression
TailBench
Snappy-C
Wikipedia Abstracts. 13 Million Search Queries.
SSSP
("SSP")
Graph Processing Maps Service: Shortest
Walking Route
Internal DIMACS NYC Streets, 10 Million Zipfian Transactions.
BFS Graph Processing Online Retail MachSuite Amazon Co-Purchasing, 10 Million Zipfian Transactions.
RBM Machine Learning Collaborative Filtering CortexSuite Netflix Prize: 10 Million Zipfian Transactions.
Temporal Redundancy
Experimental Setup
IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table.
36
WorkloadsKernel Domain Use-Case App Source Input Source and Description
DCT Video Encoding Video Server x264 YouTube Faces. 10 Videos, 10 Seconds, 24 FPS.
SAD Video Encoding Video Server PARBOIL YouTube Faces. 10 Videos, 10 Seconds, 24 FPS.
SNAPPY
("SNP")
Compression Web-Server Traffic
Compression
TailBench
Snappy-C
Wikipedia Abstracts. 13 Million Search Queries.
SSSP
("SSP")
Graph Processing Maps Service: Shortest
Walking Route
Internal DIMACS NYC Streets, 10 Million Zipfian Transactions.
BFS Graph Processing Online Retail MachSuite Amazon Co-Purchasing, 10 Million Zipfian Transactions.
RBM Machine Learning Collaborative Filtering CortexSuite Netflix Prize: 10 Million Zipfian Transactions.
Temporal Redundancy
Search Commonality
Experimental Setup
IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table.
37
WorkloadsKernel Domain Use-Case App Source Input Source and Description
DCT Video Encoding Video Server x264 YouTube Faces. 10 Videos, 10 Seconds, 24 FPS.
SAD Video Encoding Video Server PARBOIL YouTube Faces. 10 Videos, 10 Seconds, 24 FPS.
SNAPPY
("SNP")
Compression Web-Server Traffic
Compression
TailBench
Snappy-C
Wikipedia Abstracts. 13 Million Search Queries.
SSSP
("SSP")
Graph Processing Maps Service: Shortest
Walking Route
Internal DIMACS NYC Streets, 10 Million Zipfian Transactions.
BFS Graph Processing Online Retail MachSuite Amazon Co-Purchasing, 10 Million Zipfian Transactions.
RBM Machine Learning Collaborative Filtering CortexSuite Netflix Prize: 10 Million Zipfian Transactions.
Temporal Redundancy
Search Commonality
Content Popularity (75%, 90%, 95% Recurrence)
Experimental Setup
IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table.
38
Workloads
Methodology
o Evaluate ILU/CHT as ReRAM, STT-RAM, PCM, or Racetrack (Destiny)o Integrate with highly-tuned accelerators (Aladdin)
Kernel Domain Use-Case App Source Input Source and DescriptionDCT Video Encoding Video Server x264 YouTube Faces. 10 Videos, 10 Seconds, 24 FPS.
SAD Video Encoding Video Server PARBOIL YouTube Faces. 10 Videos, 10 Seconds, 24 FPS.
SNAPPY
("SNP")
Compression Web-Server Traffic
Compression
TailBench
Snappy-C
Wikipedia Abstracts. 13 Million Search Queries.
SSSP
("SSP")
Graph Processing Maps Service: Shortest
Walking Route
Internal DIMACS NYC Streets, 10 Million Zipfian Transactions.
BFS Graph Processing Online Retail MachSuite Amazon Co-Purchasing, 10 Million Zipfian Transactions.
RBM Machine Learning Collaborative Filtering CortexSuite Netflix Prize: 10 Million Zipfian Transactions.
Temporal Redundancy
Search Commonality
Content Popularity (75%, 90%, 95% Recurrence)
Experimental Setup
IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table.
39
Results
▪ Runtime-OPT: Avg. 6.0-6.4x Speedupo Negligible Differences Between Memories
IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table.
40
Results
▪ Runtime-OPT: Avg. 6.0-6.4x Speedupo Negligible Differences Between Memories
▪ EDP-OPT: Avg. 50%-68% Savingso PCM/Racetrack High write energyo Gain less for low bias apps (freq. updates)
IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table.
41
Results
▪ Runtime-OPT: Avg. 6.0-6.4x Speedupo Negligible Differences Between Memories
▪ EDP-OPT: Avg. 50%-68% Savingso PCM/Racetrack High write energyo Gain less for low bias apps (freq. updates)
▪ Energy-OPT: Avg. 22%-50% Savings o PCM unbeneficial for 75% bias SSSP/RBM
▪ General Trends:
o Large CHTs (MBs-TBs) for Speedup. Smaller (KBs-GBs) for EDP, Smallest for Energy (KBs-MBs)
IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table.
42
▪ Memoization is Fit for Accelerators
o Memoization-Ready Programming Environment+Interface
Conclusions
43
▪ Memoization is Fit for Accelerators
o Memoization-Ready Programming Environment+Interface
▪ Memoization is Fit for Datacenters
o Temporal Redundancy, Search Commonality, Content Popularity
Conclusions
▪ COREx Extends Hardware Specialization
o Memoization-layer specialization tailored for the workload
44
Conclusions
▪ COREx Extends Hardware Specialization
o Memoization-layer specialization tailored for the workload
▪ COREx Opens New Opportunities for Future Architectures
o Shift compute from non-scaling CMOS to still-scaling memories
45
Conclusions
Scaling Datacenter Accelerators With Compute-Reuse Architectures
Adi Fuchs David [email protected] [email protected]
Top Related