Why Put FPGAs in your CPU Socket? - 筑波大学yoshiki/ICFPT/2013/Day3_keynote.pdf · 2016. 5....

High-Performance Reconfigurable Computing Group

Department of Electrical and Computer EngineeringUniversity of Toronto

Why Put FPGAs in your CPU Socket?

Paul Chow

What are we talking about?

1. Start with a motherboard with multiple CPU sockets

2. Plug FPGAs into some of those sockets

Achieves the minimum latency between the FPGA and the CPU such that

FPGA CPU

is the same as

CPU CPU

December 11, 2013 FPT 2013

2

Why me?

• Not many have been able to touch in-socket accelerators

• Used in the Toronto Molecular Dynamics Machine project

• Worked closely with the group at Xilinx Labs who were developing the technology in collaboration with Intel

• Disclaimer – some fact checking via Google, but some recollections based on fading memories


3

Benefits

• Avoids major problem of accelerators– Time to move data takes longer than doing the

computation on the host

• With lower latency can do acceleration of finer-grain tasks

• Easier path to certification for data centers– “Just” swapping a CPU chip with an FPGA chip

• New architectures for processor interconnection and moving data onto CPUs (stay tuned for more)


4

Why would AMD/Intel do this?

• Make platforms more open

• Adding FPGAs allows platforms to access use cases not serviced by just CPUs– Not replacing CPUs, but for applications where FPGAs

are needed– Sell more CPUs

• Will still try to displace FPGAs eventually but learn about new requirements– FPGA companies still happy with short-term business!


5

THE 1ST GENERATION IN-SOCKET ACCELERATORS


6

AMD

• Torrenza initiative (2006) promoted accelerators using HyperTransport


7CPU

FPGA

FPGA

CPU

• HyperTransport• AMD’s processor bus• Point-to-point so scalable• Cache coherency

Memory

Memory Memory

Memory

8

In a HyperTransport CPU socket



9

In a HyperTransport HTX Socket


10

CPU

HTXHTX

CPU • Not restricted to form factor of the CPU

• Can build board to connect to HT

• More area to put other stuff, like memory

HTXHTX

FPGA

Memory

FPGA

Memory

11

FPGA in an HTX socket


Intel

• Still using Front-Side Bus– Not scalable– Intel QuickAssist Technology for acclerators


12CPU FPGA

Memory

FPGACPU

MCH (FSB Switch)

Memory

13

FPGAs in an FSB socketInside the Intel Caneland


FPT 2013

14

How it works: Cache-based communication

X86

cache

FPGA

cache

Host RAM Memory

2 3

45

• Five steps: X86 to FPGA data transfer (i.e. X86 initiates communication)

– 1) X86 writes data to memory– 2) GPR request (X86 writes into FPGA's cache address range; the content: the memory address where

data in step 1 was placed)– 3) FPGA receives cache update and initiates a DMA read (where X86 put data in step 1)– 4) Data from host's main memory is transferred to the FPGA where data is consumed– 5) GPR Acknowledge (FPGA writes into X86's cache address range; the content: a 1 bit flag that toggles

every time data is written to signal the original GPR request in step 1 has been processed)

• FPGA to X86 data transfer is similar (i.e. FPGA initiates communication)

1

Few cachelines (GPRs)

December 11, 2013

QPI: THE NEXT GENERATION


15

Intel QPI

• FSB was a bus– What we played with – more later

• Quick Path Interconnect (QPI)– Point-to-point for scalability


16

CPU FPGA

Memory Memory

Memory Memory

PCIe

PCIe

QPI

What’s Different?


17

CPU FPGA

Memory Memory

Memory Memory

PCIe

PCIe

CPU FPGA

Memory

FPGACPU

MCH (FSB Switch)

Memory

• FSB form factor limits local memory for FPGA

• Cannot provide other I/O easily – used another layer in stack

• Smaller FPGAs –V5

• Two memory banks per CPU socket – FPGA can access DIMMs, lots of local memory

• Larger FPGAs –V7• PCIe slot per socket can be

used for an I/O card

How does it work?

• Caching Agent– Holds the cache and uses (consumes) cache lines

• Home Agent– Memory controller that serves up physical address

space cache lines

• CPU is both Caching Agent and Home Agent

• FPGA can have either or both, depending on requirements


18

Compute Acceleration

• Utilize coherency provided by Caching Agent

• FPGA application accesses the same address space as the host

• Easier programming using shared-memory model


19

Custom memories


20

CPU FPGA

Memory Flash

Memory Flash

PCIe

PCIe

Bring Flash memory into the QPI memory space or some other funky memory type or behavior – include a Home Agent

But there’s more…


21

CPU FPGA

Memory Memory

Memory Memory

PCIe

PCIe

SFP

for

HS

I/O

N x

10G

High-speed cable

Utilize the PCIe slot to build I/O for FPGA

Streaming Data Processing

• Data streaming in via network links filtered in FPGA

• FPGA transfers only important data to CPU for further processing– Do not have to transfer all data to CPU memory and

then have CPU filter the data


22

Expand QPI across systems


23

CPU FPGA

Memory Memory

Memory Memory

PCIe

PCIe

SFP

for

HS

I/O

CPU FPGA

Memory Memory

Memory Memory

PCIe

PCIe

SFP

for

HS

I/O

CPUFPGA

MemoryMemory

MemoryMemory

PCIe

PCIe

SFP for HS I/O

CPUFPGA

MemoryMemory

MemoryMemory

PCIe

PCIe

SFP for HS I/O

Shared memory across QPI platforms

The “Inverted Cluster”


24

CPU FPGA

Memory Memory

Memory Memory

PCIe

PCIe

SFP

for

HS

I/O

CPU FPGA

Memory Memory

Memory Memory

PCIe

PCIe

SFP

for

HS

I/O

CPUFPGA

MemoryMemory

MemoryMemory

PCIe

PCIe

SFP for HS I/O

CPUFPGA

MemoryMemory

MemoryMemory

PCIe

PCIe

SFP for HS I/O

Network connections via FPGAs and CPUs are slaves to FPGAs – lower latency network stack

QPI vs PCIe Gen 3

QPI PCIe Gen 3

Latency About half PCIe Gen 3 for about 1KB transfer

500 ns

Bandwidth 7 GB/s x8 = 8 GB/s

Standard Proprietary Open


25

• Use QPI if you really need minimum latency• Risk with QPI is proprietary bus

• Note that Convey started with FSB and now uses PCIeGen 3

Where are they today? (a)

• 1st generation had several attempts at developing commodity systems– None exist today– Difficult technology to build– No easy programming model

• Intel developed AAL (accelerator abstraction layer)– Provides virtual memory access from the FPGA– Large page table managed by host AAL driver– Host processes can reserve accelerator by first loading page

table– Available for QPI systems


26

Where are they today? (b)

• Xilinx– Not targeting commodity sales– Pursuing customers interested in customized QPI

• Altera (Pactron) announced April IDF– no longer on Pactron web site!


27

In an Achronix FPGA!


28

http://www.achronix.com/applications/hpc.html

Heterogeneous Computing

• HSA Foundation– Heterogeneous System Architecture– Building a heterogeneous compute software ecosystem

built on open, royalty-free industry standards and open-source software

– Make processing elements work together seamlessly


29

USING THE XILINX INTEL FSB PLATFORM – A CASE STUDY

The Accelerated Computing Platform (ACP)


30

The Accelerated Computing Platform

• Developed by Xilinx• Sold through Nallatech

• Commodity platform to drive down cost

• COTS server-grade motherboard

• FPGA in Xeon socket readily available

• FSB latency and bandwidth between FPGA, Xeon and Memory


31

32

FSB Configuration Options

North Bridge

FSB8.5GB/s(peak)

21GB/s(peak)

System Memory

South Bridge

2x PCIex84GB/s

Intel’s Caneland MP Xeon platform

10GB/s

switch switch

4x PCIex8 Slots1x PCIex4 Slot

2x PCIex4 Slots

4x SATASource: Nallatech FPT 2013

32

Supported Xeon 7300 System Platforms

• ACP M2 is targeted to Intel 7300 MP server platforms

• Design mechanically validated for Intel SKU S7000FC4UR

December 11, 2013

33

ACP M2: A Flexible, Modular Architecture

• M2 Compute Module– Supports 2 large Virtex 5 FPGAs

• Can accommodate any FF1738 packaged parts• Enables up to 660K LCs per compute module

– Design allows two (2) Compute modules to be combined in a single stack if desired

• Enables up to 1,320K LCs per CPU socket• Subject to socket power limits

• M2 Base Module– The foundation module that attaches to the

7300 platform socket 604– 1066 MHz design in an FPGA!!– Features a Virtex-5 LX110 which configures as a

persistent FSB Bridge– Configures and feeds the Compute modules

under program control


34

DDR2SRAM

DDR2SRAM

DDR2SRAM

DDR2SRAM

DDR2SRAM

DDR2SRAM

DDR2SRAM

DDR2SRAM

ACP M2 Stack Topology

1,066MHz FSB8.5GB/s, 105ns

500MHzDDR LVDS

10GB/s, 5ns

300MHz DDR2.4GB/s each, 5ns

M2Base

Module

M2ComputeModule

M2ComputeModule(optional)

Config. Memories

300MHz DDR2.4GB/s each, 5ns

ACP

M2 C

ompu

te St

ack

FlashSRAM

ACP M2 Compute

FPGA(FF1738)

ACP M2 Compute

FPGA(FF1738)

ACP M2 Compute

FPGA(FF1738)

ACP M2 Compute

FPGA(FF1738)

ACP M2 Base FPGA

(FSB Bridge)

December 11, 2013

Programming ModelsThat is great! but how do we program this?

X86 X86

X86 X86

MCHSystem

Memory

FSB

Intel Quad-core Xeon

XilinxVirtex5s

ACP1 ACP0

FSBFSB

December 11, 2013

36

The Flow


37

Also a system simulation

HLS can do this

Communication Middleware

FSB Interface

MPI FSB Bridge

Xilinx FPGA

Each line is two FSLs(one in each direction)

HW Engine

HW MPI

MicroBlaze

SW MPI

HW Engine

HW MPI

LVDS interface

MPI LVDS Bridge

MGT Interface

MPI MGT Bridge

Packet


38

Achieving Portability with MPI

• Portability is achieved by using a Middleware abstraction layer. MPI natively provides software portability

• Provide a Hardware Middleware to enable hardware portability. The MPE provides the portable hardware interface to be used by a hardware accelerator


39

Host

FPGA

SW Application

SW OS

SW Middleware

Host-specificHardware

HW Application

HW OS

HeterogeneousEnvironment

HW Middleware

SW Application

SW OS

Host-specificHardware

Software Environment

SW Middleware

MPI Ring Communication Patternvoid main (int argc, char **argv) {

int x, my_rank, size; MPI_Init(…);MPI_Comm_rank(…&my_rank);MPI_Comm_size(…, &size);if ( my_rank == 0 ) {

x = 1;MPI_Send(&x,1,MPI_INT,1,…);MPI_Recv(&x,1,MPI_INT,size-1,…);

}else if (my_rank == size-1) {

MPI_Recv(&x,1,MPI_INT,my_rank-1,…);x++;MPI_Send(&x,1,MPI_INT,0,…);

} else {

MPI_Recv(&x,1,MPI_INT,my_rank-1,…);x++;MPI_Send(&x,1,MPI_INT,my_rank+1,…);

}MPI_Finalize();

}

R0

R3

R1

R2

MPI Size = 5

R4


40

Mapping Ranks to Heterogeneous Computing Elements

R0

R3

R1

R2

R4 HWEngine


41

Ring Communication Example

X86 X86

X86 X86

MCHSystem

Memory

Intel Quad-core XeonACP1 ACP0

12

3

4

5

FPGA-FPGA communicationthrough FSB without X86 intervention


42

ACP0 – M2 Base FPGA

Intel FSB

Xilinx FSB interface

MPI FSB Bridge

XCV5LX110

Each line is two FSLs(one in each direction)

MicroBlaze GPIO

LEDs

December 11, 2013

43

ACP1 – M2 Base FPGA

Intel FSB

Xilinx FSB interface

MPI FSB Bridge

XCV5LX110Each line is two FSLs(one in each direction)

FSL-LVDS

to/from compute FPGA 0

to/fromcompute FPGA 1

FSL-LVDS MicroBlaze GPIO

LEDs

RouterInit


44

ACP1 – M2 Compute 0 and 1 FPGAs

XCV5LX330

FSL-LVDS

to/from othercompute FPGA

to/fromBase FPGA FSL-LVDS

MicroBlaze GPIO

LEDs

RouterInit


45

PERFORMANCE TESTING


46

Configurations

Send round-trip messages between two MPI tasks (black squares)X86 has Xeon cores using software MPI, FPGA has hardware engines (HW) using the MPE

Δt = round_trip_time/(2*num_samples)Latency = Δt for a small message sizeBW = message_size/Δt

Measurements here are done using only FSB-Base modules

December 11, 2013

47

FPT 2013

Xeon-Xeon Xeon-HW Intra-FPGA HW-HW Inter-FPGA HW-HW

Preliminary Performance Numbers

Xeon-Xeon Xeon-HW HW-HW(intra-FPGA)

HW-HW(inter-FPGA)

Latency [μs](64-byte transfer) 1.9 2.78 0.39 3.5

Bandwidth [MB/s] 1000 410 531 400

December 11, 2013

48

FPT 2013

On-chip network using 32-bit channels and clocked at 133 MHzMPI using Rendezvous Protocol

Xilinx driver performance numbers Latency = 0.5 μs (64 byte transfer) Bandwidth = 2 GB/s

MPI Ready Protocol achieves about 1/3 of the Rendezvous latency. For Xeon-HW it is 1μs (only 2X slower than Xilinx driver transfer latency)

128-bit on-chip channels will quadruple the HW bandwidth (to approx. 2GB/s) and also reduce latency

Other performance enhancements possible

Performance Improvements

• Ready protocol– no synchronization overhead as in Rendezvous

• Tiny message protocol– lower latency for small messages (40 Bytes or less)

• From 32 to 128 bits wide data path– 32 bits @ 133 MHz = 532 MB/s– 128 bits @ 133 MHz = 2.128 GB/s

• Zero copy transfers– no intermediate copy to preallocated buffers (↑BW)


49

Latency (Point-to-Point)


50

CPU-initiated ping-pong transfers (FPGA hardware: 128 bits @ 133 MHz)

Bandwidth


51

Ping-pong test hw, 128-bits @ 133 MHz

BUILDING A LARGE HPC APPLICATION


52


53

Molecular Dynamics

• Simulate motion of molecules at atomic level

• Highly compute-intensive

• Understand protein folding

• Computer-aided drug design

The TMD Machine

• The Toronto Molecular Dynamics Machine

• Use multi-FPGA system to accelerate MD

• Built using an MPI programming model

• Principal algorithm developer: Chris Madill, Ph.D. candidate (now done!) in Biochemistry– Writes C++ using MPI, notVerilog/VHDL

• Have used three platforms – portability

• Plus scalability and maintainability


54

Platform Evolution

Network of Five V2Pro PCI Cards (2006) Network of BEE2 Multi-FPGA Boards (2007)

• First to integrate hardware acceleration• Simple LJ fluids only

• Added electrostatic terms• Added bonded terms

FPGA portability and design abstraction facilitated ongoing migration.


55

2010 – Xilinx/Nallatech ACP


56

Stack of 5 large Virtex-5FPGAs + 1 FPGA for FSBPHY interface

Quad socket Xeon Server

Origin of Computational Complexity

103

-10

10

i

iiib rrkU 20 )(

N

i

N

j ij

ji

n nrqq

U1 12

1

612

4)(rr

rV

i

iiia kU 20 )(

i iii

iiiiit nk

nnkU

0,00,cos1

2

O(n2)

O(n)

December 11, 2013

57

FPT 2013

CPUi

Processi

Bonded

Nonbonded

PME

Datai

CPUi

Processi

Bonded

Nonbonded

PME

Datai

Typical MD Simulator

CPUi

Processi

Bonded

Nonbonded

PME

Datai

CPUi

Processi

Bonded

Nonbonded

PME

Datai

December 11, 2013

58

FPT 2013

TMD Machine Architecture

Bond Engine

Visualizer

Output

Scheduler

Input

MPI::Send(&msg, size, dest …);Atom

ManagerAtomManagerAtom

Manager

Bond Engine

Long rangeElectrostatics

Engine


Engine


Engine

AtomManager

Short rangeNonbond

Engine

Short rangeNonbond

Engine

Short rangeNonbond

Engine

Short rangeNonbond

Engine

Short rangeNonbond

Engine

Short rangeNonbond

Engine

December 11, 2013

59

FPT 2013

FSB

Target Platform for MD

FSB

NBE NBE

NBE NBE

FSB

NBE NBE

NBE NBE

MEM PME

FSB

NBE NBE

NBE NBE

PME MEM

Socket0

Socket2

Socket1

Socket3

Short rangeNonbonded

Long rangeElectrostatic

Bonds

Initial Breakdown of CPU Time 12 short range nonbond FPGAs 2-3 pipelines/NBE FPGA; Each runs 15-30x CPU NBE 360-1080x

2 PME FPGAs with fast memory and fibre optic interconnects PME 420x

Bonds on quad-core Xeon server Bonds 1x

Sys Mem

Sys Mem

QuadXeon

Sys Mem

8.5 GB/s @ 1066 MHz

72.5 GB/s60


Performance Modeling

Problem :Difficult to mathematically predict the expected speedup a priori due to the contentious nature of many-to-many communications.

Solution:Measuring the non-deterministic behaviour using Jumpshot on the software version and back-annotate the deterministic behaviour.• Make use of existing tools!

December 11, 2013

61

FPT 2013

Single Timestep Profile

Timestep = 108 ms (327 506 atoms)December 11, 2013

62

FPT 2013

Performance

• Significant overlap between all force calculations.

• 108.02 ms is equivalent to between 80 and 88 Infiniband-connected cores at U of T’s supercomputer, SciNet.

• 160-176 hyperthreaded cores

• Can we do better?– 140 with hardware bond

engines – change engine from SW to HW, no architectural change

December 11, 2013

63

FPT 2013

Final Performance Equivalent for MD

FPGA/CPU Supercomputer Scaling Factor

Space 5U 17.5*2U 1/7Cooling N/A Share of 735-ton

chiller∞?

Capital Cost $15000* $120000 1/8Annual Electricity Cost

$241(Assuming 500W)

$6758 1/30

Performance (Core Equivalent)

140 Cores 1*140 Cores 140x

*Current system is a prototype. Cost is based on projections for next-generation system.December 11, 2013 FPT 2013

64

TMD Perspective

• Still comparing apples to oranges.• Individually, hardware engines are able to sustain

calculations hundreds of times faster than traditional CPUs.

• Communication costs degrade overall performance.• FPGA platform is using older CPUs and older

communication links than SciNet.• Migrating the FPGA portion to a SciNet compatible

platform will further increase the relative performance and provide a more accurate CPU/FPGA comparison.


65

Conclusion

• In-socket accelerators– Use for absolute minimum latency– Cache coherency for easier programming– Proprietary bus so at mercy of vendor– “Exotic” technology– Use only if you really, really need it!


66


67

Acknowledgements

SOCRNemSYSCAN

Questions?


68

Why Put FPGAs in your CPU Socket? - 筑波大学yoshiki/ICFPT/2013/Day3_keynote.pdf · 2016. 5....

Documents

Transcript of Why Put FPGAs in your CPU Socket? - 筑波大学yoshiki/ICFPT/2013/Day3_keynote.pdf · 2016. 5....