Why Put FPGAs in your CPU Socket? - 筑波大学yoshiki/ICFPT/2013/Day3_keynote.pdf · 2016. 5....
Transcript of Why Put FPGAs in your CPU Socket? - 筑波大学yoshiki/ICFPT/2013/Day3_keynote.pdf · 2016. 5....
High-Performance Reconfigurable Computing Group
Department of Electrical and Computer EngineeringUniversity of Toronto
Why Put FPGAs in your CPU Socket?
Paul Chow
What are we talking about?
1. Start with a motherboard with multiple CPU sockets
2. Plug FPGAs into some of those sockets
Achieves the minimum latency between the FPGA and the CPU such that
FPGA CPU
is the same as
CPU CPU
December 11, 2013 FPT 2013
2
Why me?
• Not many have been able to touch in-socket accelerators
• Used in the Toronto Molecular Dynamics Machine project
• Worked closely with the group at Xilinx Labs who were developing the technology in collaboration with Intel
• Disclaimer – some fact checking via Google, but some recollections based on fading memories
December 11, 2013 FPT 2013
3
Benefits
• Avoids major problem of accelerators– Time to move data takes longer than doing the
computation on the host
• With lower latency can do acceleration of finer-grain tasks
• Easier path to certification for data centers– “Just” swapping a CPU chip with an FPGA chip
• New architectures for processor interconnection and moving data onto CPUs (stay tuned for more)
December 11, 2013 FPT 2013
4
Why would AMD/Intel do this?
• Make platforms more open
• Adding FPGAs allows platforms to access use cases not serviced by just CPUs– Not replacing CPUs, but for applications where FPGAs
are needed– Sell more CPUs
• Will still try to displace FPGAs eventually but learn about new requirements– FPGA companies still happy with short-term business!
December 11, 2013 FPT 2013
5
THE 1ST GENERATION IN-SOCKET ACCELERATORS
December 11, 2013 FPT 2013
6
AMD
• Torrenza initiative (2006) promoted accelerators using HyperTransport
December 11, 2013 FPT 2013
7CPU
FPGA
FPGA
CPU
• HyperTransport• AMD’s processor bus• Point-to-point so scalable• Cache coherency
Memory
Memory Memory
Memory
8
In a HyperTransport CPU socket
December 11, 2013 FPT 2013
December 11, 2013 FPT 2013
9
In a HyperTransport HTX Socket
December 11, 2013 FPT 2013
10
CPU
HTXHTX
CPU • Not restricted to form factor of the CPU
• Can build board to connect to HT
• More area to put other stuff, like memory
HTXHTX
FPGA
Memory
FPGA
Memory
11
FPGA in an HTX socket
December 11, 2013 FPT 2013
Intel
• Still using Front-Side Bus– Not scalable– Intel QuickAssist Technology for acclerators
December 11, 2013 FPT 2013
12CPU FPGA
Memory
FPGACPU
MCH (FSB Switch)
Memory
13
FPGAs in an FSB socketInside the Intel Caneland
December 11, 2013 FPT 2013
FPT 2013
14
How it works: Cache-based communication
X86
cache
FPGA
cache
Host RAM Memory
2 3
45
• Five steps: X86 to FPGA data transfer (i.e. X86 initiates communication)
– 1) X86 writes data to memory– 2) GPR request (X86 writes into FPGA's cache address range; the content: the memory address where
data in step 1 was placed)– 3) FPGA receives cache update and initiates a DMA read (where X86 put data in step 1)– 4) Data from host's main memory is transferred to the FPGA where data is consumed– 5) GPR Acknowledge (FPGA writes into X86's cache address range; the content: a 1 bit flag that toggles
every time data is written to signal the original GPR request in step 1 has been processed)
• FPGA to X86 data transfer is similar (i.e. FPGA initiates communication)
1
Few cachelines (GPRs)
December 11, 2013
QPI: THE NEXT GENERATION
December 11, 2013 FPT 2013
15
Intel QPI
• FSB was a bus– What we played with – more later
• Quick Path Interconnect (QPI)– Point-to-point for scalability
December 11, 2013 FPT 2013
16
CPU FPGA
Memory Memory
Memory Memory
PCIe
PCIe
QPI
What’s Different?
December 11, 2013 FPT 2013
17
CPU FPGA
Memory Memory
Memory Memory
PCIe
PCIe
CPU FPGA
Memory
FPGACPU
MCH (FSB Switch)
Memory
• FSB form factor limits local memory for FPGA
• Cannot provide other I/O easily – used another layer in stack
• Smaller FPGAs –V5
• Two memory banks per CPU socket – FPGA can access DIMMs, lots of local memory
• Larger FPGAs –V7• PCIe slot per socket can be
used for an I/O card
How does it work?
• Caching Agent– Holds the cache and uses (consumes) cache lines
• Home Agent– Memory controller that serves up physical address
space cache lines
• CPU is both Caching Agent and Home Agent
• FPGA can have either or both, depending on requirements
December 11, 2013 FPT 2013
18
Compute Acceleration
• Utilize coherency provided by Caching Agent
• FPGA application accesses the same address space as the host
• Easier programming using shared-memory model
December 11, 2013 FPT 2013
19
Custom memories
December 11, 2013 FPT 2013
20
CPU FPGA
Memory Flash
Memory Flash
PCIe
PCIe
Bring Flash memory into the QPI memory space or some other funky memory type or behavior – include a Home Agent
But there’s more…
December 11, 2013 FPT 2013
21
CPU FPGA
Memory Memory
Memory Memory
PCIe
PCIe
SFP
for
HS
I/O
N x
10G
High-speed cable
Utilize the PCIe slot to build I/O for FPGA
Streaming Data Processing
• Data streaming in via network links filtered in FPGA
• FPGA transfers only important data to CPU for further processing– Do not have to transfer all data to CPU memory and
then have CPU filter the data
December 11, 2013 FPT 2013
22
Expand QPI across systems
December 11, 2013 FPT 2013
23
CPU FPGA
Memory Memory
Memory Memory
PCIe
PCIe
SFP
for
HS
I/O
CPU FPGA
Memory Memory
Memory Memory
PCIe
PCIe
SFP
for
HS
I/O
CPUFPGA
MemoryMemory
MemoryMemory
PCIe
PCIe
SFP for HS I/O
CPUFPGA
MemoryMemory
MemoryMemory
PCIe
PCIe
SFP for HS I/O
Shared memory across QPI platforms
The “Inverted Cluster”
December 11, 2013 FPT 2013
24
CPU FPGA
Memory Memory
Memory Memory
PCIe
PCIe
SFP
for
HS
I/O
CPU FPGA
Memory Memory
Memory Memory
PCIe
PCIe
SFP
for
HS
I/O
CPUFPGA
MemoryMemory
MemoryMemory
PCIe
PCIe
SFP for HS I/O
CPUFPGA
MemoryMemory
MemoryMemory
PCIe
PCIe
SFP for HS I/O
Network connections via FPGAs and CPUs are slaves to FPGAs – lower latency network stack
QPI vs PCIe Gen 3
QPI PCIe Gen 3
Latency About half PCIe Gen 3 for about 1KB transfer
500 ns
Bandwidth 7 GB/s x8 = 8 GB/s
Standard Proprietary Open
December 11, 2013 FPT 2013
25
• Use QPI if you really need minimum latency• Risk with QPI is proprietary bus
• Note that Convey started with FSB and now uses PCIeGen 3
Where are they today? (a)
• 1st generation had several attempts at developing commodity systems– None exist today– Difficult technology to build– No easy programming model
• Intel developed AAL (accelerator abstraction layer)– Provides virtual memory access from the FPGA– Large page table managed by host AAL driver– Host processes can reserve accelerator by first loading page
table– Available for QPI systems
December 11, 2013 FPT 2013
26
Where are they today? (b)
• Xilinx– Not targeting commodity sales– Pursuing customers interested in customized QPI
• Altera (Pactron) announced April IDF– no longer on Pactron web site!
December 11, 2013 FPT 2013
27
In an Achronix FPGA!
December 11, 2013 FPT 2013
28
http://www.achronix.com/applications/hpc.html
Heterogeneous Computing
• HSA Foundation– Heterogeneous System Architecture– Building a heterogeneous compute software ecosystem
built on open, royalty-free industry standards and open-source software
– Make processing elements work together seamlessly
December 11, 2013 FPT 2013
29
USING THE XILINX INTEL FSB PLATFORM – A CASE STUDY
The Accelerated Computing Platform (ACP)
December 11, 2013 FPT 2013
30
The Accelerated Computing Platform
• Developed by Xilinx• Sold through Nallatech
• Commodity platform to drive down cost
• COTS server-grade motherboard
• FPGA in Xeon socket readily available
• FSB latency and bandwidth between FPGA, Xeon and Memory
December 11, 2013 FPT 2013
31
32
FSB Configuration Options
North Bridge
FSB8.5GB/s(peak)
21GB/s(peak)
System Memory
South Bridge
2x PCIex84GB/s
Intel’s Caneland MP Xeon platform
10GB/s
switch switch
4x PCIex8 Slots1x PCIex4 Slot
2x PCIex4 Slots
4x SATASource: Nallatech FPT 2013
32
Supported Xeon 7300 System Platforms
• ACP M2 is targeted to Intel 7300 MP server platforms
• Design mechanically validated for Intel SKU S7000FC4UR
December 11, 2013
33
ACP M2: A Flexible, Modular Architecture
• M2 Compute Module– Supports 2 large Virtex 5 FPGAs
• Can accommodate any FF1738 packaged parts• Enables up to 660K LCs per compute module
– Design allows two (2) Compute modules to be combined in a single stack if desired
• Enables up to 1,320K LCs per CPU socket• Subject to socket power limits
• M2 Base Module– The foundation module that attaches to the
7300 platform socket 604– 1066 MHz design in an FPGA!!– Features a Virtex-5 LX110 which configures as a
persistent FSB Bridge– Configures and feeds the Compute modules
under program control
December 11, 2013 FPT 2013
34
DDR2SRAM
DDR2SRAM
DDR2SRAM
DDR2SRAM
DDR2SRAM
DDR2SRAM
DDR2SRAM
DDR2SRAM
ACP M2 Stack Topology
1,066MHz FSB8.5GB/s, 105ns
500MHzDDR LVDS
10GB/s, 5ns
300MHz DDR2.4GB/s each, 5ns
M2Base
Module
M2ComputeModule
M2ComputeModule(optional)
Config. Memories
300MHz DDR2.4GB/s each, 5ns
ACP
M2 C
ompu
te St
ack
FlashSRAM
ACP M2 Compute
FPGA(FF1738)
ACP M2 Compute
FPGA(FF1738)
ACP M2 Compute
FPGA(FF1738)
ACP M2 Compute
FPGA(FF1738)
ACP M2 Base FPGA
(FSB Bridge)
December 11, 2013
Programming ModelsThat is great! but how do we program this?
X86 X86
X86 X86
MCHSystem
Memory
FSB
Intel Quad-core Xeon
XilinxVirtex5s
ACP1 ACP0
FSBFSB
December 11, 2013
36
The Flow
December 11, 2013 FPT 2013
37
Also a system simulation
HLS can do this
Communication Middleware
FSB Interface
MPI FSB Bridge
Xilinx FPGA
Each line is two FSLs(one in each direction)
HW Engine
HW MPI
MicroBlaze
SW MPI
HW Engine
HW MPI
LVDS interface
MPI LVDS Bridge
MGT Interface
MPI MGT Bridge
Packet
December 11, 2013 FPT 2013
38
Achieving Portability with MPI
• Portability is achieved by using a Middleware abstraction layer. MPI natively provides software portability
• Provide a Hardware Middleware to enable hardware portability. The MPE provides the portable hardware interface to be used by a hardware accelerator
December 11, 2013 FPT 2013
39
Host
FPGA
SW Application
SW OS
SW Middleware
Host-specificHardware
HW Application
HW OS
HeterogeneousEnvironment
HW Middleware
SW Application
SW OS
Host-specificHardware
Software Environment
SW Middleware
MPI Ring Communication Patternvoid main (int argc, char **argv) {
int x, my_rank, size; MPI_Init(…);MPI_Comm_rank(…&my_rank);MPI_Comm_size(…, &size);if ( my_rank == 0 ) {
x = 1;MPI_Send(&x,1,MPI_INT,1,…);MPI_Recv(&x,1,MPI_INT,size-1,…);
}else if (my_rank == size-1) {
MPI_Recv(&x,1,MPI_INT,my_rank-1,…);x++;MPI_Send(&x,1,MPI_INT,0,…);
} else {
MPI_Recv(&x,1,MPI_INT,my_rank-1,…);x++;MPI_Send(&x,1,MPI_INT,my_rank+1,…);
}MPI_Finalize();
}
R0
R3
R1
R2
MPI Size = 5
R4
December 11, 2013 FPT 2013
40
Mapping Ranks to Heterogeneous Computing Elements
R0
R3
R1
R2
R4 HWEngine
December 11, 2013 FPT 2013
41
Ring Communication Example
X86 X86
X86 X86
MCHSystem
Memory
Intel Quad-core XeonACP1 ACP0
12
3
4
5
FPGA-FPGA communicationthrough FSB without X86 intervention
December 11, 2013 FPT 2013
42
ACP0 – M2 Base FPGA
Intel FSB
Xilinx FSB interface
MPI FSB Bridge
XCV5LX110
Each line is two FSLs(one in each direction)
MicroBlaze GPIO
LEDs
December 11, 2013
43
ACP1 – M2 Base FPGA
Intel FSB
Xilinx FSB interface
MPI FSB Bridge
XCV5LX110Each line is two FSLs(one in each direction)
FSL-LVDS
to/from compute FPGA 0
to/fromcompute FPGA 1
FSL-LVDS MicroBlaze GPIO
LEDs
RouterInit
December 11, 2013 FPT 2013
44
ACP1 – M2 Compute 0 and 1 FPGAs
XCV5LX330
FSL-LVDS
to/from othercompute FPGA
to/fromBase FPGA FSL-LVDS
MicroBlaze GPIO
LEDs
RouterInit
December 11, 2013 FPT 2013
45
PERFORMANCE TESTING
December 11, 2013 FPT 2013
46
Configurations
Send round-trip messages between two MPI tasks (black squares)X86 has Xeon cores using software MPI, FPGA has hardware engines (HW) using the MPE
Δt = round_trip_time/(2*num_samples)Latency = Δt for a small message sizeBW = message_size/Δt
Measurements here are done using only FSB-Base modules
December 11, 2013
47
FPT 2013
Xeon-Xeon Xeon-HW Intra-FPGA HW-HW Inter-FPGA HW-HW
Preliminary Performance Numbers
Xeon-Xeon Xeon-HW HW-HW(intra-FPGA)
HW-HW(inter-FPGA)
Latency [μs](64-byte transfer) 1.9 2.78 0.39 3.5
Bandwidth [MB/s] 1000 410 531 400
December 11, 2013
48
FPT 2013
On-chip network using 32-bit channels and clocked at 133 MHzMPI using Rendezvous Protocol
Xilinx driver performance numbers Latency = 0.5 μs (64 byte transfer) Bandwidth = 2 GB/s
MPI Ready Protocol achieves about 1/3 of the Rendezvous latency. For Xeon-HW it is 1μs (only 2X slower than Xilinx driver transfer latency)
128-bit on-chip channels will quadruple the HW bandwidth (to approx. 2GB/s) and also reduce latency
Other performance enhancements possible
Performance Improvements
• Ready protocol– no synchronization overhead as in Rendezvous
• Tiny message protocol– lower latency for small messages (40 Bytes or less)
• From 32 to 128 bits wide data path– 32 bits @ 133 MHz = 532 MB/s– 128 bits @ 133 MHz = 2.128 GB/s
• Zero copy transfers– no intermediate copy to preallocated buffers (↑BW)
December 11, 2013 FPT 2013
49
Latency (Point-to-Point)
December 11, 2013 FPT 2013
50
CPU-initiated ping-pong transfers (FPGA hardware: 128 bits @ 133 MHz)
Bandwidth
December 11, 2013 FPT 2013
51
Ping-pong test hw, 128-bits @ 133 MHz
BUILDING A LARGE HPC APPLICATION
December 11, 2013 FPT 2013
52
December 11, 2013 FPT 2013
53
Molecular Dynamics
• Simulate motion of molecules at atomic level
• Highly compute-intensive
• Understand protein folding
• Computer-aided drug design
The TMD Machine
• The Toronto Molecular Dynamics Machine
• Use multi-FPGA system to accelerate MD
• Built using an MPI programming model
• Principal algorithm developer: Chris Madill, Ph.D. candidate (now done!) in Biochemistry– Writes C++ using MPI, notVerilog/VHDL
• Have used three platforms – portability
• Plus scalability and maintainability
December 11, 2013 FPT 2013
54
Platform Evolution
Network of Five V2Pro PCI Cards (2006) Network of BEE2 Multi-FPGA Boards (2007)
• First to integrate hardware acceleration• Simple LJ fluids only
• Added electrostatic terms• Added bonded terms
FPGA portability and design abstraction facilitated ongoing migration.
December 11, 2013 FPT 2013
55
2010 – Xilinx/Nallatech ACP
December 11, 2013 FPT 2013
56
Stack of 5 large Virtex-5FPGAs + 1 FPGA for FSBPHY interface
Quad socket Xeon Server
Origin of Computational Complexity
103
-10
10
i
iiib rrkU 20 )(
N
i
N
j ij
ji
n nrqq
U1 12
1
612
4)(rr
rV
i
iiia kU 20 )(
i iii
iiiiit nk
nnkU
0,00,cos1
2
O(n2)
O(n)
December 11, 2013
57
FPT 2013
CPUi
Processi
Bonded
Nonbonded
PME
Datai
CPUi
Processi
Bonded
Nonbonded
PME
Datai
Typical MD Simulator
CPUi
Processi
Bonded
Nonbonded
PME
Datai
CPUi
Processi
Bonded
Nonbonded
PME
Datai
December 11, 2013
58
FPT 2013
TMD Machine Architecture
Bond Engine
Visualizer
Output
Scheduler
Input
MPI::Send(&msg, size, dest …);Atom
ManagerAtomManagerAtom
Manager
Bond Engine
Long rangeElectrostatics
Engine
Long rangeElectrostatics
Engine
Long rangeElectrostatics
Engine
AtomManager
Short rangeNonbond
Engine
Short rangeNonbond
Engine
Short rangeNonbond
Engine
Short rangeNonbond
Engine
Short rangeNonbond
Engine
Short rangeNonbond
Engine
December 11, 2013
59
FPT 2013
FSB
Target Platform for MD
FSB
NBE NBE
NBE NBE
FSB
NBE NBE
NBE NBE
MEM PME
FSB
NBE NBE
NBE NBE
PME MEM
Socket0
Socket2
Socket1
Socket3
Short rangeNonbonded
Long rangeElectrostatic
Bonds
Initial Breakdown of CPU Time 12 short range nonbond FPGAs 2-3 pipelines/NBE FPGA; Each runs 15-30x CPU NBE 360-1080x
2 PME FPGAs with fast memory and fibre optic interconnects PME 420x
Bonds on quad-core Xeon server Bonds 1x
Sys Mem
Sys Mem
QuadXeon
Sys Mem
8.5 GB/s @ 1066 MHz
72.5 GB/s60
December 11, 2013 FPT 2013
Performance Modeling
Problem :Difficult to mathematically predict the expected speedup a priori due to the contentious nature of many-to-many communications.
Solution:Measuring the non-deterministic behaviour using Jumpshot on the software version and back-annotate the deterministic behaviour.• Make use of existing tools!
December 11, 2013
61
FPT 2013
Single Timestep Profile
Timestep = 108 ms (327 506 atoms)December 11, 2013
62
FPT 2013
Performance
• Significant overlap between all force calculations.
• 108.02 ms is equivalent to between 80 and 88 Infiniband-connected cores at U of T’s supercomputer, SciNet.
• 160-176 hyperthreaded cores
• Can we do better?– 140 with hardware bond
engines – change engine from SW to HW, no architectural change
December 11, 2013
63
FPT 2013
Final Performance Equivalent for MD
FPGA/CPU Supercomputer Scaling Factor
Space 5U 17.5*2U 1/7Cooling N/A Share of 735-ton
chiller∞?
Capital Cost $15000* $120000 1/8Annual Electricity Cost
$241(Assuming 500W)
$6758 1/30
Performance (Core Equivalent)
140 Cores 1*140 Cores 140x
*Current system is a prototype. Cost is based on projections for next-generation system.December 11, 2013 FPT 2013
64
TMD Perspective
• Still comparing apples to oranges.• Individually, hardware engines are able to sustain
calculations hundreds of times faster than traditional CPUs.
• Communication costs degrade overall performance.• FPGA platform is using older CPUs and older
communication links than SciNet.• Migrating the FPGA portion to a SciNet compatible
platform will further increase the relative performance and provide a more accurate CPU/FPGA comparison.
December 11, 2013 FPT 2013
65
Conclusion
• In-socket accelerators– Use for absolute minimum latency– Cache coherency for easier programming– Proprietary bus so at mercy of vendor– “Exotic” technology– Use only if you really, really need it!
December 11, 2013 FPT 2013
66
December 11, 2013 FPT 2013
67
Acknowledgements
SOCRNemSYSCAN
Questions?
December 11, 2013 FPT 2013
68