1 Using HPS Switch on Bassi Jonathan Carter User Services Group Lead [email protected] NERSC User...
-
Upload
janis-eaton -
Category
Documents
-
view
217 -
download
2
Transcript of 1 Using HPS Switch on Bassi Jonathan Carter User Services Group Lead [email protected] NERSC User...
![Page 1: 1 Using HPS Switch on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl.gov NERSC User Group Meeting June 12, 2006.](https://reader036.fdocuments.in/reader036/viewer/2022062802/56649eb35503460f94bbaae7/html5/thumbnails/1.jpg)
1
Using HPS Switch on Bassi
Jonathan CarterUser Services Group Lead
NERSC User Group MeetingJune 12, 2006
![Page 2: 1 Using HPS Switch on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl.gov NERSC User Group Meeting June 12, 2006.](https://reader036.fdocuments.in/reader036/viewer/2022062802/56649eb35503460f94bbaae7/html5/thumbnails/2.jpg)
2
IBM Switch Evolution
![Page 3: 1 Using HPS Switch on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl.gov NERSC User Group Meeting June 12, 2006.](https://reader036.fdocuments.in/reader036/viewer/2022062802/56649eb35503460f94bbaae7/html5/thumbnails/3.jpg)
3
IBM Switch Evolution
Year Name Peak BW Latency Processor
1996 SP Switch 300 MB/s per node
2x150 MB/s channel
20-35 us Power2/
Power3
2000 SP Switch2 (Colony)
2GB/s per node
2x500MB/s per port
~17 us Power3/
Power4
2003 HPS
(Federation)
2GB/s per port 5-14 us Power4/
Power5
![Page 4: 1 Using HPS Switch on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl.gov NERSC User Group Meeting June 12, 2006.](https://reader036.fdocuments.in/reader036/viewer/2022062802/56649eb35503460f94bbaae7/html5/thumbnails/4.jpg)
4
HPS Switch Configuration
![Page 5: 1 Using HPS Switch on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl.gov NERSC User Group Meeting June 12, 2006.](https://reader036.fdocuments.in/reader036/viewer/2022062802/56649eb35503460f94bbaae7/html5/thumbnails/5.jpg)
5
Bassi Switch Configuration
B0101 B0201 B0301 B0401 B0501 B0601 B0701 B0801 B0901 B1001 B1101 B1201
B0102 B0202 B0302 B0402 B0502 B0602 B0702 B0802 B0902 B1002 B1102 B1202
B0103 B0203 B0303 B0403 B0503 B0603 B0703 B0803 B0903 B1003 B1103 B1203
B2904 B0304 B0404 B0504 B0704 B0804 B0904 B1004 B1104 B1204
B0205 B0305 B0405 B0505 B0705 B8905 B0905 B1005 B1105 B1205
B0206 B0306 B0406 B0506 B0706 B0806 B0906 B1006 B1106 B1206
B0207 B0307 B0407 B0507 B0707 B8907 B0907 B1007 B1107 B1207
B2908 B0308 B0408 B0508 B0708 B0808 B0908 B1008 B1108 B1208
B0209 B0309 B0409 B0709 B0809 B0909 B1009 B1109 B1209
B0210 B0310 B0410 B0710 B0810 B0910 B1010 B1110 B1210
B0211 B0311 B0411 B0711 B0811 B0911 B1011 B1111 B1211
B0212 B0312 B0412 B0712 B0812 B0912 B1012 B1112 B1212
![Page 6: 1 Using HPS Switch on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl.gov NERSC User Group Meeting June 12, 2006.](https://reader036.fdocuments.in/reader036/viewer/2022062802/56649eb35503460f94bbaae7/html5/thumbnails/6.jpg)
6
IBM Software
• Parallel Environment (PE 4.2.2) which contains poe and MPI remains unchanged
• Parallel System Support Package (PSSP 3.5.0), which contains LAPI, absorbed in Reliable Scalable Clustering Technology (RSCT 2.4.2) software stack.
![Page 7: 1 Using HPS Switch on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl.gov NERSC User Group Meeting June 12, 2006.](https://reader036.fdocuments.in/reader036/viewer/2022062802/56649eb35503460f94bbaae7/html5/thumbnails/7.jpg)
7
IBM Software
• MPI 4.2.2– Uses LAPI as reliable transport layer– Uses threads not signals for
asynchronous activities
• Binary compatible• New performance characteristics– Eager– Bulk transfer– Collectives
![Page 8: 1 Using HPS Switch on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl.gov NERSC User Group Meeting June 12, 2006.](https://reader036.fdocuments.in/reader036/viewer/2022062802/56649eb35503460f94bbaae7/html5/thumbnails/8.jpg)
8
IBM Software Stack
HPS
SMA3+ Adapter
HAL
LAPI
IF_LS
IP
MPI
Application
ESSL PESSL GPFS Sockets
VSD TCP UDP
![Page 9: 1 Using HPS Switch on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl.gov NERSC User Group Meeting June 12, 2006.](https://reader036.fdocuments.in/reader036/viewer/2022062802/56649eb35503460f94bbaae7/html5/thumbnails/9.jpg)
9
Communication Modes
• FIFO mode– Chopped into 2KB
chunks on host, copied by CPU
• Remote Direct Memory Access (RDMA)– CPU offload– One I/O bus crossing
Adapter
CPUUser Buffer
FIFORDMA
DMA
![Page 10: 1 Using HPS Switch on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl.gov NERSC User Group Meeting June 12, 2006.](https://reader036.fdocuments.in/reader036/viewer/2022062802/56649eb35503460f94bbaae7/html5/thumbnails/10.jpg)
10
RDMA (Bulk transfer)
• Overlap of communication and computation possible– Asynchronous-messaging applications– One-sided communications
• Reduce CPU work– Offload fragmentation and reassembly– Minimize packet arrival interrupts
• Reduce memory subsystem load– Zero copy transport
• Striping across adapters
![Page 11: 1 Using HPS Switch on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl.gov NERSC User Group Meeting June 12, 2006.](https://reader036.fdocuments.in/reader036/viewer/2022062802/56649eb35503460f94bbaae7/html5/thumbnails/11.jpg)
11
RDMA vs. Packet
0
500
1000
1500
2000
2500
3000
3500
0.0E+00
2.0E+00
8.0E+00
3.2E+01
1.3E+02
5.1E+02
2.0E+03
8.2E+03
3.3E+04
1.3E+05
5.2E+05
2.1E+06
8.4E+06
MSG Size
MB
/s PingPong
PingPong
![Page 12: 1 Using HPS Switch on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl.gov NERSC User Group Meeting June 12, 2006.](https://reader036.fdocuments.in/reader036/viewer/2022062802/56649eb35503460f94bbaae7/html5/thumbnails/12.jpg)
12
MPI Transfer Protocols
• Eager: send data immediately; store in remote buffer– No synchronization– Only one message sent– Uses memory for buffering (less for application)
• Rendezvous: send message header; wait for recv to be posted; send data– No data copy may be required– No memory required for buffering (more for
application)– More messages required– Synchronization (standard send blocks until recv
posted)
P0 P1
data
ack
req
ack
data
ack
![Page 13: 1 Using HPS Switch on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl.gov NERSC User Group Meeting June 12, 2006.](https://reader036.fdocuments.in/reader036/viewer/2022062802/56649eb35503460f94bbaae7/html5/thumbnails/13.jpg)
13
Eager vs. Rendezvous
0
20
40
60
80
100
120
0.0E+00
2.0E+00
8.0E+00
3.2E+01
1.3E+02
5.1E+02
2.0E+03
8.2E+03
3.3E+04
1.3E+05
MSG Size
Tim
e (
us
)
Eager
Rendevous
![Page 14: 1 Using HPS Switch on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl.gov NERSC User Group Meeting June 12, 2006.](https://reader036.fdocuments.in/reader036/viewer/2022062802/56649eb35503460f94bbaae7/html5/thumbnails/14.jpg)
14
Latency
System Intra (us) Inter (us)
Seaborg 10.5 24.5
Jacquard 0.6 4.7
Bassi 1.1 4.5
![Page 15: 1 Using HPS Switch on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl.gov NERSC User Group Meeting June 12, 2006.](https://reader036.fdocuments.in/reader036/viewer/2022062802/56649eb35503460f94bbaae7/html5/thumbnails/15.jpg)
15
Internode Comparison
0
500
1000
1500
2000
2500
3000
3500
0.0E+00
2.0E+00
8.0E+00
3.2E+01
1.3E+02
5.1E+02
2.0E+03
8.2E+03
3.3E+04
1.3E+05
5.2E+05
2.1E+06
8.4E+06
MSG Size
MB
/s
bassi
seaborg
jacquard
![Page 16: 1 Using HPS Switch on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl.gov NERSC User Group Meeting June 12, 2006.](https://reader036.fdocuments.in/reader036/viewer/2022062802/56649eb35503460f94bbaae7/html5/thumbnails/16.jpg)
16
Internode Comparison
0
50
100
150
200
250
300
350
400
0.0E+00
1.0E+00
2.0E+00
4.0E+00
8.0E+00
1.6E+01
3.2E+01
6.4E+01
1.3E+02
2.6E+02
5.1E+02
1.0E+03
2.0E+03
4.1E+03
MSG Size
MB
/s
bassi
seaborg
jacquard
![Page 17: 1 Using HPS Switch on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl.gov NERSC User Group Meeting June 12, 2006.](https://reader036.fdocuments.in/reader036/viewer/2022062802/56649eb35503460f94bbaae7/html5/thumbnails/17.jpg)
17
Intranode Comparison
0
1000
2000
3000
4000
5000
6000
7000
8000
0.0E+00
2.0E+00
8.0E+00
3.2E+01
1.3E+02
5.1E+02
2.0E+03
8.2E+03
3.3E+04
1.3E+05
5.2E+05
2.1E+06
MSG Size
MB
/s
bassi
seaborg
jacquard
![Page 18: 1 Using HPS Switch on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl.gov NERSC User Group Meeting June 12, 2006.](https://reader036.fdocuments.in/reader036/viewer/2022062802/56649eb35503460f94bbaae7/html5/thumbnails/18.jpg)
18
Intranode Comparison
0
200
400
600
800
1000
1200
1400
0.0E+00
1.0E+00
2.0E+00
4.0E+00
8.0E+00
1.6E+01
3.2E+01
6.4E+01
1.3E+02
2.6E+02
5.1E+02
1.0E+03
2.0E+03
4.1E+03
MSG Size
MB
/s
bassi
seaborg
jacquard
![Page 19: 1 Using HPS Switch on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl.gov NERSC User Group Meeting June 12, 2006.](https://reader036.fdocuments.in/reader036/viewer/2022062802/56649eb35503460f94bbaae7/html5/thumbnails/19.jpg)
19
Packed-node Comparison
0
100
200
300
400
500
600
0.0E+00
2.0E+00
8.0E+00
3.2E+01
1.3E+02
5.1E+02
2.0E+03
8.2E+03
3.3E+04
1.3E+05
5.2E+05
2.1E+06
8.4E+06
MSG Size
MB
/s
bassi
seaborg
jacquard
![Page 20: 1 Using HPS Switch on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl.gov NERSC User Group Meeting June 12, 2006.](https://reader036.fdocuments.in/reader036/viewer/2022062802/56649eb35503460f94bbaae7/html5/thumbnails/20.jpg)
20
Packed-node Comparison
0
50
100
150
200
250
300
0.0E+00
1.0E+00
2.0E+00
4.0E+00
8.0E+00
1.6E+01
3.2E+01
6.4E+01
1.3E+02
2.6E+02
5.1E+02
1.0E+03
2.0E+03
4.1E+03
MSG Size
MB
/s
bassi
seaborg
jacquard
![Page 21: 1 Using HPS Switch on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl.gov NERSC User Group Meeting June 12, 2006.](https://reader036.fdocuments.in/reader036/viewer/2022062802/56649eb35503460f94bbaae7/html5/thumbnails/21.jpg)
2121
• MP_SINGLE_THREAD– Set to Yes for slight latency
decrease, set to No for MPI I/O and OpenMP, etc.
• MP_USE_BULK_XFER– Default to Yes
• MP_BULK_MIN_MSG_SIZE– Default to ~150KB
POE environment variables
![Page 22: 1 Using HPS Switch on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl.gov NERSC User Group Meeting June 12, 2006.](https://reader036.fdocuments.in/reader036/viewer/2022062802/56649eb35503460f94bbaae7/html5/thumbnails/22.jpg)
2222
• MP_BUFFER_MEM– Default is 64MB
• MP_EAGER_LIMIT– Varies from 32KB to 1KB depending on job
size, can be increased in conjunction with MP_BUFFER_MEM
• LAPI parameters for apps with many blocking send of small mgs:– MP_REXMIT_BUF_SIZE
• Default 128 bytes
– MP_REXMIT_BUF_CNT• Default is 128 buffers
POE environment variables
![Page 23: 1 Using HPS Switch on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl.gov NERSC User Group Meeting June 12, 2006.](https://reader036.fdocuments.in/reader036/viewer/2022062802/56649eb35503460f94bbaae7/html5/thumbnails/23.jpg)
23
IBM Documentation
• RSCT for AIX 5L LAPI Programming Guide (SA22-7936-03) – LAPI programming
• Parallel Environment for AIX 5L V4.2.2Operation and Use, Vol 1 (SA22-7948-04)– Running jobs
• Parallel Environment for AIX 5L V4.2.2Operation and Use, Vol 2 (SA22-7949-04)– Performance tools
• Parallel Environment for AIX 5L V4.2.2MPI Programming Guide (SA22-7945-04)– IBM MPI implementation