Reconfigurable Accelerators for Big Data and Cloud RAW 2016Reconfigurable Accelerators for Big Data...

33
H. Peter Hofstee, Ph.D. Distinguished Research Staff Member, IBM Austin Research Professor, Big Data Systems, TU Delft May 23, 2016 Reconfigurable Accelerators for Big Data and Cloud RAW 2016 ( 23rd Reconfigurable Architectures Workshop @ IPDPS )

Transcript of Reconfigurable Accelerators for Big Data and Cloud RAW 2016Reconfigurable Accelerators for Big Data...

Page 1: Reconfigurable Accelerators for Big Data and Cloud RAW 2016Reconfigurable Accelerators for Big Data and Cloud RAW 2016 ( 23rd Reconfigurable Architectures Workshop @ IPDPS ) 2 ...

H. Peter Hofstee, Ph.D.Distinguished Research Staff Member, IBM Austin Research Professor, Big Data Systems, TU Delft May 23, 2016

Reconfigurable Accelerators for Big Data and Cloud RAW 2016 ( 23rd Reconfigurable Architectures Workshop @ IPDPS )

Page 2: Reconfigurable Accelerators for Big Data and Cloud RAW 2016Reconfigurable Accelerators for Big Data and Cloud RAW 2016 ( 23rd Reconfigurable Architectures Workshop @ IPDPS ) 2 ...

2

Source: Sandisk IT Blog

Page 3: Reconfigurable Accelerators for Big Data and Cloud RAW 2016Reconfigurable Accelerators for Big Data and Cloud RAW 2016 ( 23rd Reconfigurable Architectures Workshop @ IPDPS ) 2 ...

3

Page 4: Reconfigurable Accelerators for Big Data and Cloud RAW 2016Reconfigurable Accelerators for Big Data and Cloud RAW 2016 ( 23rd Reconfigurable Architectures Workshop @ IPDPS ) 2 ...

4

Source: Sandisk IT Blog

PCIe standard

Page 5: Reconfigurable Accelerators for Big Data and Cloud RAW 2016Reconfigurable Accelerators for Big Data and Cloud RAW 2016 ( 23rd Reconfigurable Architectures Workshop @ IPDPS ) 2 ...

5

P7 (Gen1) P7 +(Gen2) P8(Gen3) P9(Gen4)

2017

Page 6: Reconfigurable Accelerators for Big Data and Cloud RAW 2016Reconfigurable Accelerators for Big Data and Cloud RAW 2016 ( 23rd Reconfigurable Architectures Workshop @ IPDPS ) 2 ...

6

Source: Sandisk IT Blog

PCIe bw/lanein POWER systems

Page 7: Reconfigurable Accelerators for Big Data and Cloud RAW 2016Reconfigurable Accelerators for Big Data and Cloud RAW 2016 ( 23rd Reconfigurable Architectures Workshop @ IPDPS ) 2 ...

7

Page 8: Reconfigurable Accelerators for Big Data and Cloud RAW 2016Reconfigurable Accelerators for Big Data and Cloud RAW 2016 ( 23rd Reconfigurable Architectures Workshop @ IPDPS ) 2 ...

8Source: Calypso Testers, 2013

Page 9: Reconfigurable Accelerators for Big Data and Cloud RAW 2016Reconfigurable Accelerators for Big Data and Cloud RAW 2016 ( 23rd Reconfigurable Architectures Workshop @ IPDPS ) 2 ...

Today• PCIe NVMe commoditized

• Better BW/$ than HDD !

• Typical enterprise NVMe • 3.2 GB/s seq. read

• 3.2 TB

• 800K IOP ( 4K )

• 25W

• 50-100usec

• 2U with 24 SFF NVMe • 76.8 GB/s ( 96 lanes of PCIe Gen 3 )

• all available lanes in a dual-socket POWER8

• 76.8 TB

• 19.2 Million IOPs

• 600W(max) / 24 NVMe

Memory-class bandwidth even before storage-class memory!9

Logic-Case.com ( 32 x SFF )

SuperMicro

Page 10: Reconfigurable Accelerators for Big Data and Cloud RAW 2016Reconfigurable Accelerators for Big Data and Cloud RAW 2016 ( 23rd Reconfigurable Architectures Workshop @ IPDPS ) 2 ...

Three Ways to Attach• As memory

• Real address space

• Any accelerator is highest privilege level only

• Not attractive if latencies are high

• As I/O

• I/O space

• Offload (including DMA) to pinned memory only

• Typically involves extra copy operation and coordination w.CPU

• On the SMP bus

• Full access to all system resources

• Best support for offload and near-memory compute

• Most efficient synchronization mechanisms

• Adapters can interact with other adapters w/o any CPU cycles

10

Page 11: Reconfigurable Accelerators for Big Data and Cloud RAW 2016Reconfigurable Accelerators for Big Data and Cloud RAW 2016 ( 23rd Reconfigurable Architectures Workshop @ IPDPS ) 2 ...

11T. Roewer, IBM, 2016 OpenPOWER Summit

Attach as Memory: Example

Page 12: Reconfigurable Accelerators for Big Data and Cloud RAW 2016Reconfigurable Accelerators for Big Data and Cloud RAW 2016 ( 23rd Reconfigurable Architectures Workshop @ IPDPS ) 2 ...

12T. Roewer, IBM, 2016 OpenPOWER Summit

Attach as Memory: Example

Page 13: Reconfigurable Accelerators for Big Data and Cloud RAW 2016Reconfigurable Accelerators for Big Data and Cloud RAW 2016 ( 23rd Reconfigurable Architectures Workshop @ IPDPS ) 2 ...

© 2013 International Business Machines Corporation ‹#›

Custom Hardware

Application

POWER8

CAPP

Coherence Bus

PSL

FPGA or ASIC

Customizable Hardware Application Accelerator • Specific system SW, middleware, or user

application • Written to durable interface provided by

PSL

POWER8

PCIe Gen 3 Transport for encapsulated messages

Processor Service Layer (PSL) • Present robust, durable interfaces to applications • Offload complexity / content from CAPP

Virtual Addressing • Accelerator can work with same memory addresses that the

processors use • Pointers de-referenced same as the host application • Removes OS & device driver overhead

Hardware Managed Cache Coherence • Enables the accelerator to participate in “Locks” as a normal thread

Lowers Latency over IO communication model

Attach on the SMP bus: CAPI

Page 14: Reconfigurable Accelerators for Big Data and Cloud RAW 2016Reconfigurable Accelerators for Big Data and Cloud RAW 2016 ( 23rd Reconfigurable Architectures Workshop @ IPDPS ) 2 ...

© 2014 IBM Corporation

14

strategy ( )

CAPI Attached Flash Optimization

– Attach TMS Flash to POWER8 via CAPI coherent Attach– Issues Read/Write Commands from applications to eliminate 97% of code pathlength– Saves 20-30 cores per 1M IOPs

Pin buffers, Translate, Map DMA, Start I/O

Application

Read/Write Syscall

Interrupt, unmap, unpin,Iodone scheduling

20K instructions reduced to

<500

Disk and Adapter DD

strategy ( ) iodone ( )

FileSystem

Application

User Library

Posix Async

I/O Style API

Shared Memory Work Queue

aio_read()

aio_write()1

iodone ( )

LVM

Attach on the SMP bus: CAPI example

Page 15: Reconfigurable Accelerators for Big Data and Cloud RAW 2016Reconfigurable Accelerators for Big Data and Cloud RAW 2016 ( 23rd Reconfigurable Architectures Workshop @ IPDPS ) 2 ...

© 2014 IBM Corporation

1549IBM Confidential

Attach on the SMP bus: CAPI example

Page 16: Reconfigurable Accelerators for Big Data and Cloud RAW 2016Reconfigurable Accelerators for Big Data and Cloud RAW 2016 ( 23rd Reconfigurable Architectures Workshop @ IPDPS ) 2 ...

© 2014 IBM Corporation16

CASSANDRA WRITES w. Bedri SendirAttach on the SMP bus: CAPI example

Page 17: Reconfigurable Accelerators for Big Data and Cloud RAW 2016Reconfigurable Accelerators for Big Data and Cloud RAW 2016 ( 23rd Reconfigurable Architectures Workshop @ IPDPS ) 2 ...

© 2014 IBM Corporation

14

Attach on the SMP bus: CAPI example

Page 18: Reconfigurable Accelerators for Big Data and Cloud RAW 2016Reconfigurable Accelerators for Big Data and Cloud RAW 2016 ( 23rd Reconfigurable Architectures Workshop @ IPDPS ) 2 ...

© 2014, 2015 International Business Machines Corporation

Randy Swanberg, 2/10/16 14

CAPI Flash for RDD Cache = 4X memory reduction at equal performance

Acceleration Use Case Example:

Fast and general engine for large-scale data processing

RDMA for Spark Shuffle = 30% Better Response Time, Lower CPU Utilization

• CAPI Flash and RDMA Leveraged Transparently to Spark Applications • Coming…. HDFS CAPI FPGA Erasure Code Acceleration, CAPI FPGA Compression Acceleration, ….

Attach on the SMP bus: CAPI example

Page 19: Reconfigurable Accelerators for Big Data and Cloud RAW 2016Reconfigurable Accelerators for Big Data and Cloud RAW 2016 ( 23rd Reconfigurable Architectures Workshop @ IPDPS ) 2 ...

19

Attach on the SMP bus: CAPI example

In-the-box version of CAPI-attached Flash( Uses standard M.2 formfactor flash )

Page 20: Reconfigurable Accelerators for Big Data and Cloud RAW 2016Reconfigurable Accelerators for Big Data and Cloud RAW 2016 ( 23rd Reconfigurable Architectures Workshop @ IPDPS ) 2 ...

Similar Story for Networking• Fast transition from 1Gb/s to 10Gb/s to 100Gb/s

• 25Gb, 50Gb really subsets of 100Gb

• 40Gb looks less attractive ( to me )

• Partly driven by in-memory paradigms like Spark

• No longer just trying to match HDD speed

• Transition to optical ( especially across racks )

• Opens the door for much more bandwidth in future

• Coherent attach allows:

• Most offload to network adapter

• Network adapter to reach all resources ( including storage ) without touching the CPU 20

Page 21: Reconfigurable Accelerators for Big Data and Cloud RAW 2016Reconfigurable Accelerators for Big Data and Cloud RAW 2016 ( 23rd Reconfigurable Architectures Workshop @ IPDPS ) 2 ...

21

Usual Story for Compute

Heterogeneous/

Reminder: Multi-Core Drives Memory Pressure

Page 22: Reconfigurable Accelerators for Big Data and Cloud RAW 2016Reconfigurable Accelerators for Big Data and Cloud RAW 2016 ( 23rd Reconfigurable Architectures Workshop @ IPDPS ) 2 ...

Conclusions so far …

A LOT of pressure on bandwidth

Strong desire to be SMP-attached

So what are we doing about it?

22

Page 23: Reconfigurable Accelerators for Big Data and Cloud RAW 2016Reconfigurable Accelerators for Big Data and Cloud RAW 2016 ( 23rd Reconfigurable Architectures Workshop @ IPDPS ) 2 ...

23

1: This is a big challenge: don’t go it alone …

OpenPowerFoundation.org

Page 24: Reconfigurable Accelerators for Big Data and Cloud RAW 2016Reconfigurable Accelerators for Big Data and Cloud RAW 2016 ( 23rd Reconfigurable Architectures Workshop @ IPDPS ) 2 ...

24source: B. McCredie 2016 OpenPOWER summit

2: Publish a roadmap …

Page 25: Reconfigurable Accelerators for Big Data and Cloud RAW 2016Reconfigurable Accelerators for Big Data and Cloud RAW 2016 ( 23rd Reconfigurable Architectures Workshop @ IPDPS ) 2 ...

25

3: New standards for coherent attach: CAPI

Page 26: Reconfigurable Accelerators for Big Data and Cloud RAW 2016Reconfigurable Accelerators for Big Data and Cloud RAW 2016 ( 23rd Reconfigurable Architectures Workshop @ IPDPS ) 2 ...

26

4: Increase Bandwidth

POWER8 + NVLinkSource: NVIDIA5-12x PCIe Gen3 x16

Page 27: Reconfigurable Accelerators for Big Data and Cloud RAW 2016Reconfigurable Accelerators for Big Data and Cloud RAW 2016 ( 23rd Reconfigurable Architectures Workshop @ IPDPS ) 2 ...

27

POWER8 + NVLINK System ( shown @ OpenPOWER Summit April 2016 )

source: B. McCredie 2016 OpenPOWER summit

Page 28: Reconfigurable Accelerators for Big Data and Cloud RAW 2016Reconfigurable Accelerators for Big Data and Cloud RAW 2016 ( 23rd Reconfigurable Architectures Workshop @ IPDPS ) 2 ...

28source: B. McCredie 2016 OpenPOWER summit

5: New high-bandwidth standard for coherent attach: OpenCAPI (CAPI 2.0)

Page 29: Reconfigurable Accelerators for Big Data and Cloud RAW 2016Reconfigurable Accelerators for Big Data and Cloud RAW 2016 ( 23rd Reconfigurable Architectures Workshop @ IPDPS ) 2 ...

29

Get in the game: Free resources to try this out & develop your own …

Page 30: Reconfigurable Accelerators for Big Data and Cloud RAW 2016Reconfigurable Accelerators for Big Data and Cloud RAW 2016 ( 23rd Reconfigurable Architectures Workshop @ IPDPS ) 2 ...

30

Get in the game: SuperVessel examples

Page 31: Reconfigurable Accelerators for Big Data and Cloud RAW 2016Reconfigurable Accelerators for Big Data and Cloud RAW 2016 ( 23rd Reconfigurable Architectures Workshop @ IPDPS ) 2 ...

31

Get in the game: Free resources to try this out & develop your own …

https://wikis.utexas.edu/display/fabric/Home

FAbRIC: FPGA Accelerator Research Infrastructure Cloud

Page 32: Reconfigurable Accelerators for Big Data and Cloud RAW 2016Reconfigurable Accelerators for Big Data and Cloud RAW 2016 ( 23rd Reconfigurable Architectures Workshop @ IPDPS ) 2 ...

32

Register NowContest:

June 1 - Sep 1, 2016

Page 33: Reconfigurable Accelerators for Big Data and Cloud RAW 2016Reconfigurable Accelerators for Big Data and Cloud RAW 2016 ( 23rd Reconfigurable Architectures Workshop @ IPDPS ) 2 ...

33