Download - Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse

Scaling Datacenter Accelerators With Compute-Reuse Architectures

Adi Fuchs and David Wentzlaff

ISCA 2018 Session 5AJune 5, 2018 Los Angeles, CA


2

Sources:"Cramming more components onto integrated circuits” GE Moore, Computer 1965“Next-Gen Power Solutions for Hyperscale Data Centers”, DataCenter Knowledge 2016



3




4



?


5

Sources:“Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective”, Hazelwood et al. HPCA 2018“Cloud TPU”, Google, https://cloud.google.com/tpu/“FPGA Accelerated Computing Using AWS F1 Instances”, David Pellerin, AWS summit 2017“Microsoft unveils Project Brainwave for real-time AI“, Doug Burger, https://www.microsoft.com/en-us/research/blog/microsoft-unveils-project-brainwave/“NVIDIA TESLA V100“, NVIDIA, https://www.nvidia.com/en-us/data-center/tesla-v100/


https://cloud.google.com/tpu/

https://www.microsoft.com/en-us/research/blog/microsoft-unveils-project-brainwave/

https://www.nvidia.com/en-us/data-center/tesla-v100/


6







7



Transistor scaling stops. Chip specialization runs out of steam.

What’s Next?





8

Observation I: The Density of Emerging Memories are Projected to Increase


ITRS Logic Roadmap


9

Source:”Face recognition in unconstrained videos with matched background similarity”, Wolf et al., CVPR 2011

Observation II: Datacenter Accelerators Perform Redundant Computations

▪ Temporal locality introduces redundancy in videos encoders (recurrent blocks in white)

t=0 sec t=2 sec t=4 sec


10

Source:”Face recognition in unconstrained videos with matched background similarity”, Wolf et al., CVPR 2011


▪ Temporal locality introduces redundancy in videos encoders (recurrent blocks in white)

t=0 sec

0% recurrence 38% recurrence 61% recurrence

t=2 sec t=4 sec


11


▪ Search term commonality retrieves the similar content

intercontinental downtown los angeles

Source: Google


12




Source: Google

hotel in downtown los angeles near intercontinental


13




Source: Google

hotel in downtown los angeles near intercontinental


14

Source: Twitter


▪ Power laws suggest high recurrent processing of popular content


15

Source: Twitter


▪ Power laws suggest high recurrent processing of popular content

Memoization: Tables store past computation outputs. Reuse outputs of recurring inputs instead of recomputing.

COREx: Compute-Reuse Architecture For Accelerators


16

InputLookup

core result

DMA Engine

Accelerator Core

input

input output

Acceleration Fabric

Shared LLC / NoC

Host Processors

Scratchpad Memory




17

InputLookup

lookup

fetchedresult

core result

core result

DMA Engine

Accelerator Core

input

input output

Compute-Reuse Storage

Acceleration Fabric

Shared LLC / NoC

hit

Host Processors

Scratchpad Memory




18

InputLookup

lookup

fetchedresult

core result

core result

DMA Engine

Accelerator Core

input

input output

Compute-Reuse Storage

Acceleration Fabric

Shared LLC / NoC

hit

Host Processors

Scratchpad Memory

19

Architectural Guidelines

Accelerator Core

Specialized Compute Lanes

ScratchpadDMA

EngineGeneral-Purpose

CMP

Shared LLC

20


▪ Accelerators Memoization is Naturalo Little or no additional programming efforto Built-in input-compute-output flow

Accelerator Core


ScratchpadDMA


CMP

Shared LLC

Output

Input

Compute

21



▪ But Not Straightforward!o High lookup costso Unnecessary accesses o High access costs

▪ COREx Key Ideas:o Hashing (reduce lookup costs)o Lookup filtering (fewer accesses)o Banking (reduce access costs)

Accelerator Core


ScratchpadDMA


CMP

Shared LLC

Output

Input

Compute

22



▪ But Not Straightforward!o High lookup costso Unnecessary accesses o High access costs

▪ COREx Key Ideas:o Hashing (reduce lookup costs)o Lookup filtering (fewer accesses)o Banking (reduce access costs)

Accelerator Core


ScratchpadDMA


CMP

Shared LLC

Output

Input

Compute

Goal: Extend Specialization with Workload-Specific Memoization

23

Accelerator Core


Scratchpad General-Purpose CMP

Shared LLC

SoC Interconnect

Mem. Chip

Func. Block

Datapath

Control

Top Level Architecture

DMA Engine

▪ New Modules:

o Input Hashing Unit (IHU)

24

Accelerator Core



Shared LLC

IHU

COREx Interconnect

SoC Interconnect

Mem. Chip

Func. Block

Datapath

Control


DMA Engine

▪ New Modules:


o Input Lookup Unit (ILU)

25

Accelerator Core



Shared LLC

IHU

ILU

Cache Ctrl.

COREx Interconnect

SoC Interconnect

Mem. Chip

Func. Block

Datapath

Control


DMA Engine

Hashes

AssociativeCache

▪ New Modules:



o Computation History Table (CHT)

26

Accelerator Core



Shared LLC

IHU

CHTILU

Cache Ctrl.

COREx Interconnect

SoC Interconnect

RAM-Array Ctrl.

RAM-Array Table

Mem. Chip

Func. Block

Datapath

Control AssociativeCache


DMA Engine

Fetch

▪ New Modules:




27

Accelerator Core



Shared LLC

IHU

CHTILU

Cache Ctrl.

COREx Interconnect

SoC Interconnect

RAM-Array Ctrl.

RAM-Array Table

Mem. Chip

Func. Block

Datapath



DMA Engine

Fetch

Match Input

▪ New Modules:




28

Accelerator Core



Shared LLC

IHU

CHTILU

Cache Ctrl.

COREx Interconnect

SoC Interconnect

RAM-Array Ctrl.

RAM-Array Table

Mem. Chip

Func. Block

Datapath



DMA Engine

Use Output

Fetch

IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table.

Building COREx

Case Study: Acceleration of Video Motion Estimation

▪ Optimization Goals:

o Runtime, Energy, and Energy-Delay Product (EDP)

▪ Baseline: highly-tuned accelerators

o Sweep space for design alternatives (Aladdin)

o Find optimal accelerator design for each goal

29


Building COREx







30


Building COREx

Runtime OPT: 5.8[us]

Energy OPT: 6.2[uJ]

EDP OPT: 148.7[pJs]







31

32

▪ Memoization-Layers Specialization

o Extract input traces, examine hit and miss rates of different ILU/CHT sizes.

o Integrate accelerators with emerging memory based ILU+CHT, and sweep gains space.

▪ Example: Resistive RAM based COREx

Building COREx


33

▪ Memoization-Layers Specialization

o Extract input traces, examine hit and miss rates of different ILU/CHT sizes.

o Integrate accelerators with emerging memory based ILU+CHT, and sweep gains space.

▪ Example: Resistive RAM based COREx

Building COREx

Energy Optimization: 56.6% Energy Saved.

64KB ILU, 8MB CHT

EDP Optimization:63.5% EDP Saved.

512KB ILU, 2GB CHT

Runtime Optimization:2.7x Speedup.

512KB ILU, 32GB CHT


34

Kernel Domain Use-Case App Source Input Source and DescriptionDCT Video Encoding Video Server x264 YouTube Faces. 10 Videos, 10 Seconds, 24 FPS.

SAD Video Encoding Video Server PARBOIL YouTube Faces. 10 Videos, 10 Seconds, 24 FPS.

SNAPPY

("SNP")

Compression Web-Server Traffic

Compression

TailBench

Snappy-C

Wikipedia Abstracts. 13 Million Search Queries.

SSSP

("SSP")

Graph Processing Maps Service: Shortest

Walking Route

Internal DIMACS NYC Streets, 10 Million Zipfian Transactions.

BFS Graph Processing Online Retail MachSuite Amazon Co-Purchasing, 10 Million Zipfian Transactions.

RBM Machine Learning Collaborative Filtering CortexSuite Netflix Prize: 10 Million Zipfian Transactions.

Experimental Setup


Workloads

35

WorkloadsKernel Domain Use-Case App Source Input Source and Description

DCT Video Encoding Video Server x264 YouTube Faces. 10 Videos, 10 Seconds, 24 FPS.


SNAPPY

("SNP")


Compression

TailBench

Snappy-C


SSSP

("SSP")


Walking Route




Temporal Redundancy

Experimental Setup


36




SNAPPY

("SNP")


Compression

TailBench

Snappy-C


SSSP

("SSP")


Walking Route




Temporal Redundancy

Search Commonality

Experimental Setup


37




SNAPPY

("SNP")


Compression

TailBench

Snappy-C


SSSP

("SSP")


Walking Route




Temporal Redundancy

Search Commonality

Content Popularity (75%, 90%, 95% Recurrence)

Experimental Setup


38

Workloads

Methodology

o Evaluate ILU/CHT as ReRAM, STT-RAM, PCM, or Racetrack (Destiny)o Integrate with highly-tuned accelerators (Aladdin)

Kernel Domain Use-Case App Source Input Source and DescriptionDCT Video Encoding Video Server x264 YouTube Faces. 10 Videos, 10 Seconds, 24 FPS.


SNAPPY

("SNP")


Compression

TailBench

Snappy-C


SSSP

("SSP")


Walking Route




Temporal Redundancy

Search Commonality

Content Popularity (75%, 90%, 95% Recurrence)

Experimental Setup


39

Results

▪ Runtime-OPT: Avg. 6.0-6.4x Speedupo Negligible Differences Between Memories


40

Results


▪ EDP-OPT: Avg. 50%-68% Savingso PCM/Racetrack High write energyo Gain less for low bias apps (freq. updates)


41

Results


▪ EDP-OPT: Avg. 50%-68% Savingso PCM/Racetrack High write energyo Gain less for low bias apps (freq. updates)

▪ Energy-OPT: Avg. 22%-50% Savings o PCM unbeneficial for 75% bias SSSP/RBM

▪ General Trends:

o Large CHTs (MBs-TBs) for Speedup. Smaller (KBs-GBs) for EDP, Smallest for Energy (KBs-MBs)


42

▪ Memoization is Fit for Accelerators

o Memoization-Ready Programming Environment+Interface

Conclusions

43

▪ Memoization is Fit for Accelerators

o Memoization-Ready Programming Environment+Interface

▪ Memoization is Fit for Datacenters

o Temporal Redundancy, Search Commonality, Content Popularity

Conclusions

▪ COREx Extends Hardware Specialization

o Memoization-layer specialization tailored for the workload

44

Conclusions

▪ COREx Extends Hardware Specialization

o Memoization-layer specialization tailored for the workload

▪ COREx Opens New Opportunities for Future Architectures

o Shift compute from non-scaling CMOS to still-scaling memories

45

Conclusions


Adi Fuchs David [email protected] [email protected]