Early X1 Experiences at Boeing - CUG
Transcript of Early X1 Experiences at Boeing - CUG
![Page 1: Early X1 Experiences at Boeing - CUG](https://reader031.fdocuments.in/reader031/viewer/2022012504/617e8b024b7db90bb85cb520/html5/thumbnails/1.jpg)
Early X1 Experiencesat Boeing
Jim Glidewell
Information Technology Services
Boeing Shared Services Group
![Page 2: Early X1 Experiences at Boeing - CUG](https://reader031.fdocuments.in/reader031/viewer/2022012504/617e8b024b7db90bb85cb520/html5/thumbnails/2.jpg)
Page 2 X1 Experiences - Cray User Group - May 2004
Early X1 Experiences at Boeing
• HPC computing environment
• X1 configuration
• Hardware and OS
• Applications
• Support
• Summary
![Page 3: Early X1 Experiences at Boeing - CUG](https://reader031.fdocuments.in/reader031/viewer/2022012504/617e8b024b7db90bb85cb520/html5/thumbnails/3.jpg)
Page 3 X1 Experiences - Cray User Group - May 2004
Current HPC systems
• Two Cray T-90's
• A 384 CPU Origin 3800
• Three 256-CPU Linux clusters
• Cray X1
![Page 4: Early X1 Experiences at Boeing - CUG](https://reader031.fdocuments.in/reader031/viewer/2022012504/617e8b024b7db90bb85cb520/html5/thumbnails/4.jpg)
Page 4 X1 Experiences - Cray User Group - May 2004
Cray X1
• Fully populated liquid-cooled chassis
• 64 MSPs
• 512 GB of memory, 32GB per node
• Additional Java Server
• Total of 26 terabytes of LSI RAID disk
• managed by ADIC StorNext software
• X1 is partitioned into two systems
• 14 nodes production partition
• 2 node test partition
• Allows testing of weekly OS updates
• Added additional complexity to network
![Page 5: Early X1 Experiences at Boeing - CUG](https://reader031.fdocuments.in/reader031/viewer/2022012504/617e8b024b7db90bb85cb520/html5/thumbnails/5.jpg)
Page 5 X1 Experiences - Cray User Group - May 2004
Timeline
• Early 2003 - Continuing discussions with Cray on X1
• August 2003 - Detailed plan for X1 transition
• September 19, 2003 - Final approval for X1 acquisition
• November, 2003 - Factory Visit
• January 2, 2004 - System Delivered
• January 15, 2004 - Early User Access
• March 1, 2004 - Limited X1 Production
• March 8, 2004 - Full X1 Production
• April 24, 2004 - StorNext managed SAN put into production
![Page 6: Early X1 Experiences at Boeing - CUG](https://reader031.fdocuments.in/reader031/viewer/2022012504/617e8b024b7db90bb85cb520/html5/thumbnails/6.jpg)
Page 6 X1 Experiences - Cray User Group - May 2004
Hardware experiences
• Fluorinert pumps• Four of four pumps replaced• One pump replaced twice, due to install
error which ran it dry• Bad batch has been identified by Cray
• First system shipped with 32GB per node
• Memory errors seen at factory
• 1GB DIMMs were out of timing spec
• Cray developed new memory test to validate DIMMs
• Resulting in a one month delay in ship
• CPU and memory reliability have been excellent
![Page 7: Early X1 Experiences at Boeing - CUG](https://reader031.fdocuments.in/reader031/viewer/2022012504/617e8b024b7db90bb85cb520/html5/thumbnails/7.jpg)
Page 7 X1 Experiences - Cray User Group - May 2004
UNICOS/mp experiences
• System stability
• Data compare errors
• Administration and Operations
![Page 8: Early X1 Experiences at Boeing - CUG](https://reader031.fdocuments.in/reader031/viewer/2022012504/617e8b024b7db90bb85cb520/html5/thumbnails/8.jpg)
Page 8 X1 Experiences - Cray User Group - May 2004
System Stability
• Kernel panics
• We've seen very few since start of production in March
• Application migration is still disabled
• Job aborts due to node memory oversubscription
• Overall the OS has been stable
• Exceeded our expectations of one kernel panic per week
• Improvements continue
![Page 9: Early X1 Experiences at Boeing - CUG](https://reader031.fdocuments.in/reader031/viewer/2022012504/617e8b024b7db90bb85cb520/html5/thumbnails/9.jpg)
Page 9 X1 Experiences - Cray User Group - May 2004
Data compare errors
• Problems arose during acceptance testing of disks
• Errors were extremely infrequent (one failure every 72 hours oftesting)
• Concerns about data integrity
• Cray dedicated multiple systems to replicating the problem
• Root cause was lack of "I/O cache coherency"
• I/O started before data was flushed from cache to memory
• DMA picked up stale data
• Problem resolved by ensuring all data flushed before writes
• Similar timing windows closed in other I/O related code
![Page 10: Early X1 Experiences at Boeing - CUG](https://reader031.fdocuments.in/reader031/viewer/2022012504/617e8b024b7db90bb85cb520/html5/thumbnails/10.jpg)
Page 10 X1 Experiences - Cray User Group - May 2004
Administration and Operations
• Experience with IRIX really helpful
• Disk configuration & backups
• Network
• PBSPro
• Accounting
![Page 11: Early X1 Experiences at Boeing - CUG](https://reader031.fdocuments.in/reader031/viewer/2022012504/617e8b024b7db90bb85cb520/html5/thumbnails/11.jpg)
Page 11 X1 Experiences - Cray User Group - May 2004
Disk configuration
• All disks are LSI RAID
• Complex
• Leaned heavily on Cray support
• Limited guidance on performance tuning
• Delayed StorNext managed SAN for months
• Backups
• Weekly backups to tape, daily disk to disk copies
• Long term plan to rely heavily on StorNext
![Page 12: Early X1 Experiences at Boeing - CUG](https://reader031.fdocuments.in/reader031/viewer/2022012504/617e8b024b7db90bb85cb520/html5/thumbnails/12.jpg)
Page 12 X1 Experiences - Cray User Group - May 2004
Network
• Lots of components - X1, CNS, CPES, X1-JS, backup CNS
• Very complicated - hope nothing breaks!
![Page 13: Early X1 Experiences at Boeing - CUG](https://reader031.fdocuments.in/reader031/viewer/2022012504/617e8b024b7db90bb85cb520/html5/thumbnails/13.jpg)
Page 13 X1 Experiences - Cray User Group - May 2004
eth2.20
FSS0 Primary
SNFS Net Addresses192.168.10.2 FSS-0192.168.10.3 FSS-1192.168.10.10 CPES192.168.10.11 X1-JS192.168.10.20-22 CNS
•/etc/hosts• xxx.xxx.214.242 on fss0• xxx.xxx.214.243 on fss1• establishes cvfsid (license) and ACSLS client identity
• fsroutes on all nodes to redirect MD traffic to SNFS net
PAM 10.0.109.0
Stor
Nex
t 1
92.1
68.1
0.0
PAM 10.0.104.0
FSS1 Alternate
CWS
znb6
X1-JS
znb0 ce3IP/FC <> X1
znb5
RAID
RAID PAM10.0.116.0
Boeing X1 Nets – PAM, RAID, StorNext
CPES
IP/FC <> X1
eth1
.21
.4
.2.10
.11
.3
.2
PAM 10.0.106.0
switch0.private admin
X1 CNS1 BackupIP/FC <> X1
GbE <> CustomerNet
eth1 eth0.21 .2
12
X1 CNS0 PrimaryIP/FC <> X1
GbE <> CustomerNet
eth1 eth0.20 .1
12
StorNext 192.168.10.0X1 CNS2 Test
IP/FC <> X1GbE <> CustomerNet
eth1 eth0.22 .3
12
znb0 ce3
2
12
eth2 eth1
SFM1
Ethernet.11
SFM0
Ethernet.10
1
.242eth0
.243eth0
AC
SLS
xxx.xx.21 4. 19 2
![Page 14: Early X1 Experiences at Boeing - CUG](https://reader031.fdocuments.in/reader031/viewer/2022012504/617e8b024b7db90bb85cb520/html5/thumbnails/14.jpg)
Page 14 X1 Experiences - Cray User Group - May 2004
PBSPro
• Solid and reliable for us
• Excellent documentation
• User documentation needed (What's an MPPE?)
• Good interaction with psched
• But PBS knows nothing about flexible vs. accelerated mode
• Disabled migration can leave applications stuck in postedqueue
• Expect improvements at 2.4
![Page 15: Early X1 Experiences at Boeing - CUG](https://reader031.fdocuments.in/reader031/viewer/2022012504/617e8b024b7db90bb85cb520/html5/thumbnails/15.jpg)
Page 15 X1 Experiences - Cray User Group - May 2004
Accounting
• No Cray System Accounting...
• But project accounting is supported
• Using a locally written program to sum usage by user, project
• Session id will be included in UNICOS 2.4 process records
![Page 16: Early X1 Experiences at Boeing - CUG](https://reader031.fdocuments.in/reader031/viewer/2022012504/617e8b024b7db90bb85cb520/html5/thumbnails/16.jpg)
Page 16 X1 Experiences - Cray User Group - May 2004
Applications and Tuning
• Compilation time is still an issue
• Cray has provided significant help in getting our key applicationsperforming well on the X1
• Overflow
• Tranair
• Other applications
![Page 17: Early X1 Experiences at Boeing - CUG](https://reader031.fdocuments.in/reader031/viewer/2022012504/617e8b024b7db90bb85cb520/html5/thumbnails/17.jpg)
Page 17 X1 Experiences - Cray User Group - May 2004
Overflow
• CFD code developed at NASA Ames
• Parallel CFD code using MLP (Multi-Level Parallelism)
• Scales well to large number of nodes (256 or more)
• Runs very well on our NUMA-based system
• Initial performance was roughly 16X that of a 400Mhz CPU on ourexisting Overflow system
• After tuning by Cray personnel, performance is now 25X
• Code changes have been integrated to NASA's version
![Page 18: Early X1 Experiences at Boeing - CUG](https://reader031.fdocuments.in/reader031/viewer/2022012504/617e8b024b7db90bb85cb520/html5/thumbnails/18.jpg)
Page 18 X1 Experiences - Cray User Group - May 2004
Tranair
• Boeing developed CFD code
• Adaptive grid single-CPU FORTRAN code
• Out-of-core solver - made heavy use of SSD on T-90s
• Highly optimized for T-90 series
• Large memory requirement - 7-12GB per job, desire larger
• Forced to use MSP mode due to memory needs
• Initial speed ratio was 0.9-1.18 relative to T-90, both MSP andSSP
• Current MSP speed ratio is 2.0 times the speed of T-90
![Page 19: Early X1 Experiences at Boeing - CUG](https://reader031.fdocuments.in/reader031/viewer/2022012504/617e8b024b7db90bb85cb520/html5/thumbnails/19.jpg)
Page 19 X1 Experiences - Cray User Group - May 2004
Support
• Cray has provided excellent support throughout the process
• Initial hardware and software setup
• Local technical support
• Technical folks in Chippewa Falls and Mendota
• Hardware resources to reproduce and debug problems
• Excellent training & documentation
• Cray’s assistance was essential in the success of our installationon a very aggressive schedule
![Page 20: Early X1 Experiences at Boeing - CUG](https://reader031.fdocuments.in/reader031/viewer/2022012504/617e8b024b7db90bb85cb520/html5/thumbnails/20.jpg)
Page 20 X1 Experiences - Cray User Group - May 2004
Summary
• Significant hardware and software issues were encountered andresolved
• Memory DIMM problems
• Data compare errors
• Kernel panics
• StorNext teething pains
• Cray has provided great support throughout the process
• Users are happy with the turnaround and overall reliability of the X1
• The X1 is already a key part of our CFD design process
![Page 21: Early X1 Experiences at Boeing - CUG](https://reader031.fdocuments.in/reader031/viewer/2022012504/617e8b024b7db90bb85cb520/html5/thumbnails/21.jpg)
Page 21 X1 Experiences - Cray User Group - May 2004
Coming soon…