Comparison of Communication and I/O of the Cray T3E and IBM SP
description
Transcript of Comparison of Communication and I/O of the Cray T3E and IBM SP
![Page 1: Comparison of Communication and I/O of the Cray T3E and IBM SP](https://reader036.fdocuments.in/reader036/viewer/2022070420/56815f92550346895dce9329/html5/thumbnails/1.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
1
Comparison of Communication and I/O of the Cray T3E and IBM SP
Jonathan CarterNERSC User Services
![Page 2: Comparison of Communication and I/O of the Cray T3E and IBM SP](https://reader036.fdocuments.in/reader036/viewer/2022070420/56815f92550346895dce9329/html5/thumbnails/2.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
2
Overview
• Node Characteristics• Interconnect Characteristics• MPI Performance• I/O Configuration• I/O Performance
![Page 3: Comparison of Communication and I/O of the Cray T3E and IBM SP](https://reader036.fdocuments.in/reader036/viewer/2022070420/56815f92550346895dce9329/html5/thumbnails/3.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
3
T3E Architecture
• Distributed memory, single CPU processing elements
Interconnect
CPU Memory
![Page 4: Comparison of Communication and I/O of the Cray T3E and IBM SP](https://reader036.fdocuments.in/reader036/viewer/2022070420/56815f92550346895dce9329/html5/thumbnails/4.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
4
T3E Communication Network
• Processing Elements (PE) are connected by a 3D torus.
![Page 5: Comparison of Communication and I/O of the Cray T3E and IBM SP](https://reader036.fdocuments.in/reader036/viewer/2022070420/56815f92550346895dce9329/html5/thumbnails/5.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
5
T3E Communication Network
• The peak bandwidth of the torus is about 600 Mbyte/sec per link bidirectional
• Sustainable bandwidth is about 480 Mbytes/sec bidirectional• Latency is 1 microsec.• shmem API gives latency of 1 microsec., bandwidth 350
Mbyte/sec bidirectional
![Page 6: Comparison of Communication and I/O of the Cray T3E and IBM SP](https://reader036.fdocuments.in/reader036/viewer/2022070420/56815f92550346895dce9329/html5/thumbnails/6.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
6
SP Architecture
• Cluster of SMP nodes
Interconnect
Memory
CPU
CPU
![Page 7: Comparison of Communication and I/O of the Cray T3E and IBM SP](https://reader036.fdocuments.in/reader036/viewer/2022070420/56815f92550346895dce9329/html5/thumbnails/7.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
7
SP Communication Network• Nodes are connected via adapters to the SP Switch. Switch is
composed of boards which link 16 nodes. Boards are linked to form larger network.
Switch Board
Nodes
![Page 8: Comparison of Communication and I/O of the Cray T3E and IBM SP](https://reader036.fdocuments.in/reader036/viewer/2022070420/56815f92550346895dce9329/html5/thumbnails/8.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
8
SP Communication Network
• The peak bandwidth per node is 300 Mbyte/sec bidirectional• Latency of the switch is about 2 microsec.• Sustainable bandwidth is about 185 Mbytes/sec bidirectional
![Page 9: Comparison of Communication and I/O of the Cray T3E and IBM SP](https://reader036.fdocuments.in/reader036/viewer/2022070420/56815f92550346895dce9329/html5/thumbnails/9.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
9
MPI Performance
T3E SP(intra-node)
SP(inter-node)
Latencyμs
12 10 22
BandwidthMbyte/s
270 300 150
Point-to-point, single transmitter, single receiver
Intra-node is 1 MPI process per node, 2 MPI processes (typical) will halve bandwidth
![Page 10: Comparison of Communication and I/O of the Cray T3E and IBM SP](https://reader036.fdocuments.in/reader036/viewer/2022070420/56815f92550346895dce9329/html5/thumbnails/10.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
10
MPI Performance
![Page 11: Comparison of Communication and I/O of the Cray T3E and IBM SP](https://reader036.fdocuments.in/reader036/viewer/2022070420/56815f92550346895dce9329/html5/thumbnails/11.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
11
MPI Performance
MPI_Bcast
0
20
40
60
80
100
120
140
16 32 64 128Procs.
T3E 256 bytesSP 256 bytesT3E 1024 bytesSP 1024 bytes
![Page 12: Comparison of Communication and I/O of the Cray T3E and IBM SP](https://reader036.fdocuments.in/reader036/viewer/2022070420/56815f92550346895dce9329/html5/thumbnails/12.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
12
T3E I/O Configuration
• PEs do not have local disk• All PEs access all filesystems equivalently• Path for (optimum) I/O generally looks like:
– PE to I/O node via torus– I/O node to Fibre Channel Node (FCN) via Gigaring– FCN to Disk Array via Fibre loop
• In some cases data on APP PE must be transferred to a system buffer on an OS PE then out to an FCN
![Page 13: Comparison of Communication and I/O of the Cray T3E and IBM SP](https://reader036.fdocuments.in/reader036/viewer/2022070420/56815f92550346895dce9329/html5/thumbnails/13.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
13
T3E I/O Configuration
I/O FCN
Gigaring
Disk Arrays
![Page 14: Comparison of Communication and I/O of the Cray T3E and IBM SP](https://reader036.fdocuments.in/reader036/viewer/2022070420/56815f92550346895dce9329/html5/thumbnails/14.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
14
SP I/O Configuration
• Nodes have local disk. One SCSI disk for all local filesystems. Non-optimal.
• All nodes access Global Parallel File System (GPFS) filesystems equivalently
• Path for GPFS I/O looks like:– Node to GPFS Node via IP over the switch– GPFS Node to Disk Array via SSA (Serial Storage Architecture)
loop
![Page 15: Comparison of Communication and I/O of the Cray T3E and IBM SP](https://reader036.fdocuments.in/reader036/viewer/2022070420/56815f92550346895dce9329/html5/thumbnails/15.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
15
SP I/O Configuration
Nodes
Switch
Switch
GPFS Nodes
Disk Array
![Page 16: Comparison of Communication and I/O of the Cray T3E and IBM SP](https://reader036.fdocuments.in/reader036/viewer/2022070420/56815f92550346895dce9329/html5/thumbnails/16.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
16
T3E Filesystems• /usr/tmp
– fast– subject to 14 day purge, not backed up– check quota with quota -s /usr/tmp (usually 75Gb and 6000 inodes)
• $TMPDIR– fast– purged at end of job or session– shares quota with /usr/tmp
• $HOME– slower– permanent, backed up– check quota with quota (usually 2Gb and 3500 inodes)
![Page 17: Comparison of Communication and I/O of the Cray T3E and IBM SP](https://reader036.fdocuments.in/reader036/viewer/2022070420/56815f92550346895dce9329/html5/thumbnails/17.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
17
SP Filesystems• /scratch and $SCRATCH
– global– fast (GPFS)– subject to 14 day purge (or at session end for $SCRATCH), not backed up– check quota with myquota (usually 100Gb and 6000 inodes)
• $TMPDIR– local (created in /scr) - only 2 Gbyte total– slower– purged at end of job or session
• $HOME– global– slower (GPFS)– permanent, not backed up yet– check quota with myquota (usually 4Gb and 5000 inodes)
![Page 18: Comparison of Communication and I/O of the Cray T3E and IBM SP](https://reader036.fdocuments.in/reader036/viewer/2022070420/56815f92550346895dce9329/html5/thumbnails/18.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
18
Types of I/O
• Bewildering number of choices on both machines:– Standard Language I/O: Fortran or C (ANSI or POSIX)– Vendor extensions to language I/O – MPI I/O– Cray FFIO library (can be used from Fortran or C)– IBM MIO library, requires code changes
![Page 19: Comparison of Communication and I/O of the Cray T3E and IBM SP](https://reader036.fdocuments.in/reader036/viewer/2022070420/56815f92550346895dce9329/html5/thumbnails/19.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
19
Standard Language I/O
• Fortran direct access is slightly more efficient then sequential access both on the T3E (see comments on FFIO later) and the SP. It also allows file transferability.
• C language I/O (fopen, fwrite, etc.) is inefficient on both machines.
• POSIX standard I/O (open, read, etc.) can be efficient on the T3E, but requires care (see comments on FFIO later). Works well on the SP.
![Page 20: Comparison of Communication and I/O of the Cray T3E and IBM SP](https://reader036.fdocuments.in/reader036/viewer/2022070420/56815f92550346895dce9329/html5/thumbnails/20.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
20
Vendor Extensions to Language I/O
• Cray has a number of I/O routines (aqopen, etc.) which are legacies from the PVP systems. Non-portable.
• IBM has extended Fortran syntax to provide asynchronous I/O. Non-portable.
![Page 21: Comparison of Communication and I/O of the Cray T3E and IBM SP](https://reader036.fdocuments.in/reader036/viewer/2022070420/56815f92550346895dce9329/html5/thumbnails/21.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
21
MPI I/O
• Part of MPI-2• Interface for High Performance Parallel I/O
– data partitioning– collective I/O– asynchronous I/O– portability and interoperability between T3E and SP
• Different subsets implemented on T3E and SP
![Page 22: Comparison of Communication and I/O of the Cray T3E and IBM SP](https://reader036.fdocuments.in/reader036/viewer/2022070420/56815f92550346895dce9329/html5/thumbnails/22.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
22
Summary of access routines for T3E
Positioning Synchronism CoordinationNon-collective Collective
Explicit BlockingNon-blocking
READ_AT READ_AT_ALL
IREAD_AT READ_AT_ALL_BEGINWAIT READ_AT_ALL_END
Individual BlockingNon-blocking
READ READ_ALL
IREAD READ_ALL_BEGINWAIT READ_ALL_END
Shared BlockingNon-Blocking
READ_SHARED READ_ORDERED
IREAD_SHARED READ_ORDERED_BEGINWAIT READ_ORDERED_END
![Page 23: Comparison of Communication and I/O of the Cray T3E and IBM SP](https://reader036.fdocuments.in/reader036/viewer/2022070420/56815f92550346895dce9329/html5/thumbnails/23.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
23
Summary of access routines for SP
Positioning Synchronism CoordinationNon-collective Collective
Explicit BlockingNon-blocking
READ_AT READ_AT_ALL
IREAD_AT READ_AT_ALL_BEGINWAIT READ_AT_ALL_END
Individual BlockingNon-blocking
READ READ_ALL
IREAD READ_ALL_BEGINWAIT READ_ALL_END
Shared BlockingNon-Blocking
READ_SHARED READ_ORDERED
IREAD_SHARED READ_ORDERED_BEGINWAIT READ_ORDERED_END
![Page 24: Comparison of Communication and I/O of the Cray T3E and IBM SP](https://reader036.fdocuments.in/reader036/viewer/2022070420/56815f92550346895dce9329/html5/thumbnails/24.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
24
Cray FFIO library
• FFIO is a set of I/O layers tuned for different I/O characteristics
• Buffering of data (configurable size)• Caching of data (configurable size)• Available to regular Fortran I/O without reprogramming• Available for C through POSIX-like calls, e.g. ffopen, ffwrite
![Page 25: Comparison of Communication and I/O of the Cray T3E and IBM SP](https://reader036.fdocuments.in/reader036/viewer/2022070420/56815f92550346895dce9329/html5/thumbnails/25.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
25
FFIO - The assign command
• controls program behavior at runtime• the assign command controls
– controls which FFIO layer is active– striping across multiple partitions– lots more
• scope of assign– File name– Fortran unit number– File type (e.g. all sequential unformatted files)
![Page 26: Comparison of Communication and I/O of the Cray T3E and IBM SP](https://reader036.fdocuments.in/reader036/viewer/2022070420/56815f92550346895dce9329/html5/thumbnails/26.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
26
IBM MIO library
• User interface based on POSIX I/O routines, so requires program modification
• Useful trace module to collect statistics• Not much experience with using on GPFS filesystem• Coming soon
![Page 27: Comparison of Communication and I/O of the Cray T3E and IBM SP](https://reader036.fdocuments.in/reader036/viewer/2022070420/56815f92550346895dce9329/html5/thumbnails/27.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
27
I/O Strategies - Exclusive access files
• Each process reads and writes to a separate file– Language I/O
• Increase language I/O performance with FFIO library (for example, specify a large buffer with the bufa layer) on T3E. For Fortran direct access default buffer is only the maximum of the record length or 32 Kbytes
• read/write large amounts of data per request on the SP
– MPI I/O• read/write large amounts of data per request
![Page 28: Comparison of Communication and I/O of the Cray T3E and IBM SP](https://reader036.fdocuments.in/reader036/viewer/2022070420/56815f92550346895dce9329/html5/thumbnails/28.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
28
bufa FFIO layer Overview
• bufa is an asynchronous buffering layer• performs read-ahead, write-behind• specify buffer size in assign call, with -F bufa:bs:nbufs where bs is the buffer size in units of 4Kbyte blocks, and nbufs is the number of buffers
• buffer space increases your application’s memory requirements
![Page 29: Comparison of Communication and I/O of the Cray T3E and IBM SP](https://reader036.fdocuments.in/reader036/viewer/2022070420/56815f92550346895dce9329/html5/thumbnails/29.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
29
I/O Strategies - Shared files
• All PEs read and write the same file simultaneously– Language I/O (requires FFIO library global layer for T3E)– MPI I/O– On T3E, language I/O with FFIO library global layer and Cray
extensions for additional flexibility– This can sequentialize your reads and ruin your I/O performance
![Page 30: Comparison of Communication and I/O of the Cray T3E and IBM SP](https://reader036.fdocuments.in/reader036/viewer/2022070420/56815f92550346895dce9329/html5/thumbnails/30.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
30
Positioning with a shared file
• Positioning of a read or write is your responsibility• File pointers are private• Fortran
– Use a direct access file, and read/write(rec=num)– Use Cray T3E extensions setpos and getpos to position file
pointer (not portable)
• C– Use ffseek
• MPI I/O– MPI I/O fileview generally takes care of this. Positioning routines
also available.
![Page 31: Comparison of Communication and I/O of the Cray T3E and IBM SP](https://reader036.fdocuments.in/reader036/viewer/2022070420/56815f92550346895dce9329/html5/thumbnails/31.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
31
global FFIO layer Overview
• global is a caching and buffering layer which enables multiple PEs to read and write to the same file
• if one PE has already read the data, an additional read request from another PE will result in a remote memory copy
• file open is a synchronizing event• By default, all PEs must open a global file, this can be
changed by calling GLIO_GROUP_MPI(comm)• specify buffer size with assign -F global:bs:nbufs
where bs is the buffer size in units of 4Kbyte blocks, and nbufs is the number of buffers per PE
![Page 32: Comparison of Communication and I/O of the Cray T3E and IBM SP](https://reader036.fdocuments.in/reader036/viewer/2022070420/56815f92550346895dce9329/html5/thumbnails/32.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
32
GPFS and shared files
• On the T3E the global FFIO layer takes care of updates to a file from multiple PEs by tracking the state of the file across all PEs.
• On the SP, GPFS implements a safe update scheme via tokens and a token manager.– If two processes access the same block of a GPFS file (256 Kbytes),
a negotiation is conducted between the nodes and the token manager to determine the order of updates. This can slow down I/O considerably.
– MPI I/O merges requests from different processes to alleviate this problem
![Page 33: Comparison of Communication and I/O of the Cray T3E and IBM SP](https://reader036.fdocuments.in/reader036/viewer/2022070420/56815f92550346895dce9329/html5/thumbnails/33.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
33
I/O Performance Comparison• Each process writes a 200 Mbyte file. 2 processes per node on SP.
![Page 34: Comparison of Communication and I/O of the Cray T3E and IBM SP](https://reader036.fdocuments.in/reader036/viewer/2022070420/56815f92550346895dce9329/html5/thumbnails/34.jpg)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
34
Further Information
• I/O on the T3E Tutorial by Richard Gerber at http://home.nersc.gov/training/tutorials
• Cray Publication - Application Programmer’s I/O Guide• Cray Publication - Cray T3E Fortran Optimization Guide• man assign• XL Fortran User’s Guide