Post on 04-Jan-2016
IBM Systems & Tech. Group Bangalore, India HiPC 2004, Dec. 19-22
Copyright by IBM
HPS Switch and AdapterArchitecture, Design & Performance
Rama K Govindarajuramag@us.ibm.com
HiPC ConferenceBangalore, India
December 19-22, 2004
IBM Systems & Tech. Group Bangalore, India HiPC 2004, Dec. 19-22
Copyright by IBM
Team• Architecture
– Peter Hochschild, Don Grice, Kevin Gildea, Rama Govindaraju
• Hardware– Carl A Bender, Jay Herring, Piyush
Chaudhary, Steven Martin, Jason Goscinski, John Houston, …
• Software– Chulho Kim, Robert Blackmore, Rajeev
Sivaram, Hanhong Xue, …
• And many others contributed to this effort
IBM Systems & Tech. Group Bangalore, India HiPC 2004, Dec. 19-22
Copyright by IBM
Outline
• What is HPS?
• Example HPS customers
• Interconnect Historical Performance
• HPS switch architecture
• HPS adapter architecture
• HPS software architecture
• Transport Modes
• HPS Performance
• Lessons Learned and Future Work
IBM Systems & Tech. Group Bangalore, India HiPC 2004, Dec. 19-22
Copyright by IBM
What is HPS?• HPS (High Performance Switch)
– 4th generation switch and adapter to interconnect IBM’s Power processor based nodes (Power 4 and 5)
– To be used in many of the world’s fastest supercomputers
• 20 of the top 100 today use HPS
– Addressing requirements of• HPC labs, DOE, and others• Weather Forecasting, Petroleum sector, Automotive and
Aerospace sector• NSA and DOD
– Core infrastructure for the 100TF ASCI Purple system to be delivered in June 2005
IBM Systems & Tech. Group Bangalore, India HiPC 2004, Dec. 19-22
Copyright by IBM
Example HPS CustomersMore than 30 and growingSeveral over 1000 CPUsTotal over: 200TF
IBM Systems & Tech. Group Bangalore, India HiPC 2004, Dec. 19-22
Copyright by IBM
Historical Interconnect Performance
1993 1996 1998 2000 2004
Adapter
Switch
Processor
TB2
HPS
Power 2
TB3
TBS
Power 2
TBMX
TBS
Power PC/3
Colony
SP-Switch2
Power 3
HPS
HPS
Power 4
Peak link bandwidth
40MB/s 150MB/s 150MB/s 500MB/s 2GB/s
MPI bandwidth
35MB/s 110MB/s 135MB/s 375MB/s 1.8-14GB/s
MPI latency 40us 24us 21us 17us <4.2us
Links/node server
1 1 1 1,2 2,4,6,8
IBM developed Switch Interconnects and Adapters
IBM Systems & Tech. Group Bangalore, India HiPC 2004, Dec. 19-22
Copyright by IBM
HPS Switch Fabric
switch board
12 meter copper cablesor
80 meter fiber cables
GX bus
Power4 and Power5 based servers
link driverscopper driverfiber optics driver
adapter
LDC
GX bus
RAM
GX bus RAMCanopus
Canopus
LDC
GX bus
RAM
GX bus RAMCanopus
Canopus
LDC
GX bus
RAM
GX busRAMCanopus
Canopus
HPS switch chip
LDC chipHPS Adapters Agilent optics
12 meter copper cablesor
40 meter fiber cables
LDC
GX bus
RAM
GX bus RAMCanopus
Canopus
4K end points, 59ns latency, 2GB/s bandwidth per link per direction
IBM Systems & Tech. Group Bangalore, India HiPC 2004, Dec. 19-22
Copyright by IBM
HPS Adapter Microcode Model
General Parameters256 byte
Channel Buffer512 byte
Packet Header128 byte
Packet Data256 - 2K byte16K byte total
server
interface
fabric
interface
8M bytes
SRAM
Formattermask, rotate, merge
Formatter RAM256 entries
ALU
parallel mask, shift,
arithmetic & branch
Instruction RAM4K entries of 64 bits
Program Counter
General Registers16 entries of 64 bits
Task Registers16 entries of 64 bits
control - status
IAMover
PacketMover
DataMover
Format
4
MMIO
32+64
memoryfetch
16
memorystore
16
SRAM
16
IAread
8
IAwrite
8
PM0
16
PM 1
IBM Systems & Tech. Group Bangalore, India HiPC 2004, Dec. 19-22
Copyright by IBM
HMCFNM DD
HYP
HPS Switch Fabric
HPS Adapter
User Space Kernel Space
LAPI
IBM’s MPI
Parallel ESSL
VSD
GPFS SOCKETS
TCP UDP
IP
APPLICATION
ES
SL
IF_LSHAL
ServiceProcessor
HPS Software ArchitectureL
LC
SM
IBM Systems & Tech. Group Bangalore, India HiPC 2004, Dec. 19-22
Copyright by IBM
User SpaceKernel Space
MPI
LAPI
HAL
Federation Adapter
Interface Layer
User Buffer
HAL BuffersIP Interface
UDP TCP
Sockets
FIFO versus RDMA models
FIFOcopy
FIFODMA
RDMA
IBM Systems & Tech. Group Bangalore, India HiPC 2004, Dec. 19-22
Copyright by IBM
Supported Communication Modes• FIFO Mode
– Message chopped into 2K packet chunks on the host and copied by CPU
– Memory bus crossing depends on caching. At least 1 IO bus crossing
• RDMA enablement – No slave side protocol– CPU offload – Enhanced Programming
model– 1 IO bus crossing
UserBuffer
CPU
Network FIFO
Adapter
Ld/St
Ld/St
DMA
RDMA
IBM Systems & Tech. Group Bangalore, India HiPC 2004, Dec. 19-22
Copyright by IBM
RDMA value proposition• Possible overlap of computation and communication
– Fragmentation/reassembly offloaded to the adapter– Minimize packet arrival interrupts– Requires application to be written take advantage of overlap
• One sided programming model• Zero copy transport and reduced memory subsystem
load• Striping advantage• KEY DIFFERENTIATOR: reliable RDMA protocol over
unreliable datagram transport– Allows striping across multiple paths – Out of order arrival – Reduces hot spotting and contention
• Cons– Pinned memory usage– Resource management and fairness issues
IBM Systems & Tech. Group Bangalore, India HiPC 2004, Dec. 19-22
Copyright by IBM
Federation Performance• Summary:
– Latency: Power 4, 1.9GHz, HPS• MPI latency 4.34us• Interrupt latency: adds 10us• 8 task latency: adds 1us
– Bandwidth: Power 4, 1.9GHz, HPS• FIFO mode:
– Unidirectional bandwidth: ~ 1.8GB/s– Bidirectional bandwidth: 2.1GB/s
• RDMA mode:– Unidirectional bandwidth: ~1.8GB/s– Bidirectional bandwidth: ~3.0GB/s– Linear striping performance up to 8 links
» Unidirectional: 14GB/s, Bidirectional: 24GB/s
• These are preliminary measurements
IBM Systems & Tech. Group Bangalore, India HiPC 2004, Dec. 19-22
Copyright by IBM
HPS: MPI LatencyMachine Type Latency Measurement
1.9GHz, p690+ 4.34us
1.7GHz, p690+ 4.72us
1.7GHz, p655+ 4.70us
1.5GHz, p690+ 5.15us
1.3GHz, p690 5.5us
All measurements measured using IBM’s thread safe MPI libraries8 task latency adds approximately 1 additional microsecondInterrupt latency adds approximately 10-12 microsecondsAll measurements are preliminary
IBM Systems & Tech. Group Bangalore, India HiPC 2004, Dec. 19-22
Copyright by IBM
Unidirectional Bandwidth Peak
Machine Type Peak Uni-dir Bandwidth
1.9GHz, p690+ 1.800GB/s
1.7GHz, p690+ 1.686GB/s
1.7GHz, p655+ 1.800GB/s
1.5GHz, p690+ 1.470GB/s
1.3GHz, p690 1.170GB/s
All measurements are preliminary
IBM Systems & Tech. Group Bangalore, India HiPC 2004, Dec. 19-22
Copyright by IBM
Unidirectional Bandwidth Profile
0
200
400
600
800
1000
1200
1400
1600
1800
2000
Message Size (bytes)
Ban
dwid
th (
MB
/s)
P655, 1.7GHz based systemM1/2= 32K, M3/4=128K
IBM Systems & Tech. Group Bangalore, India HiPC 2004, Dec. 19-22
Copyright by IBM
Bidirectional Bandwidth Profile
0
500
1000
1500
2000
2500
Message Size (bytes)
Ban
dwid
th (
MB
/s)
P655, 1.7GHz based systemM1/2=16K, M3/4=64K
IBM Systems & Tech. Group Bangalore, India HiPC 2004, Dec. 19-22
Copyright by IBM
T1 T2 T3 T1 T2 T3 T1 T2 T3
= Communication time by thread/task
a) Asynchronous Model b) Synchronous Model c) Aggregate Comm Thread Model
Striping Options
IBM Systems & Tech. Group Bangalore, India HiPC 2004, Dec. 19-22
Copyright by IBM
Striping Models
MPI LayerLA
PI L
ayer
LAP
I Lay
er
LAP
I Lay
er
HA
L
HA
L
HA
L
ADAPTERS
MPI Layer
LAP
I Lay
er
HA
L
HA
L
HA
L
ADAPTERS
Multiple threads doing copies model Single Thread with Pipelined RDMA model
Second approach: - More elegant failover model - Less synchronization issues and CPU contention via RDMA
IBM Systems & Tech. Group Bangalore, India HiPC 2004, Dec. 19-22
Copyright by IBM
RDMA Unidirectional BandwidthPreliminary RDMA Unidirectional BW
0
2000
4000
6000
8000
10000
12000
14000
16000
16384 32768 65536 131072 262144 524288 1048576 2097152 4194304 8388608 16777216 33554432 67108864 1.34E+08 2.68E+08
Message Size
Ban
dw
idth
MB
/s
Single Link Two Links Four Links Eight Links
IBM Systems & Tech. Group Bangalore, India HiPC 2004, Dec. 19-22
Copyright by IBM
RDMA Bidirectional BandwidthPreliminary RDMA Bidirectional BW
0
5000
10000
15000
20000
25000
16384 32768 65536 131072 262144 524288 1048576 2097152 4194304 8388608 16777216 33554432 67108864 1.34E+08 2.68E+08
Message Size
Ba
nd
wid
th M
B/s
Single Link Two Links Four Links Eight Links
IBM Systems & Tech. Group Bangalore, India HiPC 2004, Dec. 19-22
Copyright by IBM
How can users exploit RDMA?• Overlap computation and communication
– Non blocking calls– Reuse communication buffers if possible– User exposed RDMA in 11/05
• Minimize interrupts for large transfers• Reduce contention for memory• Better raw bandwidth for messages over 80KB• Possibility of overlapping collectives better (via
striping)• IP transport much more efficient (translates to
improved GPFS performance)• Select striping when sending large messages
IBM Systems & Tech. Group Bangalore, India HiPC 2004, Dec. 19-22
Copyright by IBM
Future Work• Enabling HPS for Power 5 based nodes• Exploit SMT in Power 5 processor for
FIFO mode• Further attack MPI latency• Use RDMA to improve MPI collectives
performance• Parallel file systems (GPFS) further
exploitation of IP over RDMA• Take lessons learned into the Percs
project