A Study of Network Congestion in Two Supercomputing High ... · Presented on August 16, 2019 at...
Transcript of A Study of Network Congestion in Two Supercomputing High ... · Presented on August 16, 2019 at...
![Page 1: A Study of Network Congestion in Two Supercomputing High ... · Presented on August 16, 2019 at HOTI 26 Email: sjha8@Illinois.edu. DOE HPC Facilities Source: hpcwire HPC interconnect](https://reader035.fdocuments.in/reader035/viewer/2022071019/5fd274bb08993c21022ecc48/html5/thumbnails/1.jpg)
A Study of Network Congestion in Two Supercomputing High-Speed Interconnects
Saurabh Jha*, Archit Patke*, Jim Brandt^, Ann Gentile^, Mike Showerman^, Eric Roman^^, Zbigniew Kalbarczyk*, Bill Kramer*, Ravi Iyer*
* UIUC/NCSA, ^ Sandia National Lab, ^^ NERSC
Presented on August 16, 2019 at HOTI 26
Email: [email protected]
![Page 2: A Study of Network Congestion in Two Supercomputing High ... · Presented on August 16, 2019 at HOTI 26 Email: sjha8@Illinois.edu. DOE HPC Facilities Source: hpcwire HPC interconnect](https://reader035.fdocuments.in/reader035/viewer/2022071019/5fd274bb08993c21022ecc48/html5/thumbnails/2.jpg)
DOE HPC Facilities
Source: hpcwireHPC interconnect technologies: • Cray Gemini (3D Torus), Aries (DragonFly), Slingshot (DragonFly)• Mellanox InfiniBand (Non-blocking Fat Tree)• IBM BG/Q proprietary technology (5D Torus)
![Page 3: A Study of Network Congestion in Two Supercomputing High ... · Presented on August 16, 2019 at HOTI 26 Email: sjha8@Illinois.edu. DOE HPC Facilities Source: hpcwire HPC interconnect](https://reader035.fdocuments.in/reader035/viewer/2022071019/5fd274bb08993c21022ecc48/html5/thumbnails/3.jpg)
Studied HPC SystemsNCSA Blue Waters NERSC Edison
Interconnect: Cray Gemini (3-D Torus) Interconnect: Cray Aries (DragonFly)
![Page 4: A Study of Network Congestion in Two Supercomputing High ... · Presented on August 16, 2019 at HOTI 26 Email: sjha8@Illinois.edu. DOE HPC Facilities Source: hpcwire HPC interconnect](https://reader035.fdocuments.in/reader035/viewer/2022071019/5fd274bb08993c21022ecc48/html5/thumbnails/4.jpg)
Mystery of Application Performance Variation
MILC Completion Time (NERSC Edison) NAMD Completion Time (NCSA Blue Waters)
Variation caused by • Interference from other applications• Non-optimal configuration settings (too many knobs!)
e.g., huge pages, placement, message size, node sharing
150200250300350400450500550
0 2 4 6 8 10 12 14
Runtime(min)
Run Number
Mean Runtime : 318 minsStandard Deviation in Runtime: 103 mins
Mean Runtime : 576 secsStandard Deviation in Runtime: 100 secs
450500550600650700750800850
0 20 40 60 80 100 120
Runtime(sec)
Run Number
![Page 5: A Study of Network Congestion in Two Supercomputing High ... · Presented on August 16, 2019 at HOTI 26 Email: sjha8@Illinois.edu. DOE HPC Facilities Source: hpcwire HPC interconnect](https://reader035.fdocuments.in/reader035/viewer/2022071019/5fd274bb08993c21022ecc48/html5/thumbnails/5.jpg)
Monet: End-to-End Interconnect Monitoring Workflow
![Page 6: A Study of Network Congestion in Two Supercomputing High ... · Presented on August 16, 2019 at HOTI 26 Email: sjha8@Illinois.edu. DOE HPC Facilities Source: hpcwire HPC interconnect](https://reader035.fdocuments.in/reader035/viewer/2022071019/5fd274bb08993c21022ecc48/html5/thumbnails/6.jpg)
Data Measurement and Metrics
• Congestion measured in terms of Percent Time Stalled (PTS)
𝑃"# = 100 ∗𝑇)*𝑇)
𝑇)* is the time spent in stalled state
𝑇) is the measurement interval
• On Blue Waters: 𝑇) is 60s, data gathered: ~700 MB/minute• On Edison: 𝑇) is 1s, data gathered: ~7.5 GB/minute
![Page 7: A Study of Network Congestion in Two Supercomputing High ... · Presented on August 16, 2019 at HOTI 26 Email: sjha8@Illinois.edu. DOE HPC Facilities Source: hpcwire HPC interconnect](https://reader035.fdocuments.in/reader035/viewer/2022071019/5fd274bb08993c21022ecc48/html5/thumbnails/7.jpg)
150200250300350400450500550
0 5 10 15 20 25 30 35
ExecutionTime(mins)
Max of average PTs acrossall regions overlapping theapplication topology (%)
NegLowMedHigh
Correlation Between NAMD Completion Time and Network Congestion (NCSA Blue Waters)Congestion Cloud (NCSA Blue Waters)
Characterizing Congestion On Toroidal Networks
Performance Variation Depends on Duration, Size and Intensity of Congestion Clouds
Low Medium High
Neg: 0-5%, Low: 5-15%, Medium: 15-25%, High > 25%
![Page 8: A Study of Network Congestion in Two Supercomputing High ... · Presented on August 16, 2019 at HOTI 26 Email: sjha8@Illinois.edu. DOE HPC Facilities Source: hpcwire HPC interconnect](https://reader035.fdocuments.in/reader035/viewer/2022071019/5fd274bb08993c21022ecc48/html5/thumbnails/8.jpg)
Characterizing Congestion On Toroidal Networks
0100200300400500600700800
0 5 10 15 20 25 30 35 40 45 50
Duration(minutes)
PTS Threshold (%)
median99%ile99.9%ile
Hotspot Link Duration(NCSA Blue Waters)
• Continuous presence of highly congested links• 95% of the operation time contained a region with a min size of 20 links. • Max size of 6904 (17%) links• Average congestion duration: 16 minutes, 95th percentile: 16 hours
Low
Neg
High
Med
1.36%
98.29% 0.25%
0.10%
0.53
0.47
0.53
0.99
0.01 0.08
0.32
0.05
0.42
0.4
0.12
0.05
0.02
Congestion cloud: % of total system-time
![Page 9: A Study of Network Congestion in Two Supercomputing High ... · Presented on August 16, 2019 at HOTI 26 Email: sjha8@Illinois.edu. DOE HPC Facilities Source: hpcwire HPC interconnect](https://reader035.fdocuments.in/reader035/viewer/2022071019/5fd274bb08993c21022ecc48/html5/thumbnails/9.jpg)
Comparing with DragonFly Network – Link Hotspot Characterization
0
5
10
15
20
25
0 2 4 6 8 10 12 14 16 18 20
Duration(minutes)
PTS Threshold (%)
median99%ile99.9%ile
Hotspot Link Duration Aries(NERSC Edison)
• Duration of continuous congestion on links significantly reduced in DragonFly• Hotspots continue to exist
0100200300400500600700800
0 5 10 15 20 25 30 35 40 45 50
Duration(minutes)
PTS Threshold (%)
median99%ile99.9%ile
Hotspot Link Duration(NCSA Blue Waters)
![Page 10: A Study of Network Congestion in Two Supercomputing High ... · Presented on August 16, 2019 at HOTI 26 Email: sjha8@Illinois.edu. DOE HPC Facilities Source: hpcwire HPC interconnect](https://reader035.fdocuments.in/reader035/viewer/2022071019/5fd274bb08993c21022ecc48/html5/thumbnails/10.jpg)
Measuring Load Imbalance: Equivalence Groups
Congestion Imbalanced Scenario
Link Congestion (PTS)
L1 20%
L2 10%
Congestion Balanced Scenario (Ideal)
Link Congestion (PTS)
L1 15%
L2 15%L1, L2 belong to an equivalence group
N1 sending data to N2
N1
N2
![Page 11: A Study of Network Congestion in Two Supercomputing High ... · Presented on August 16, 2019 at HOTI 26 Email: sjha8@Illinois.edu. DOE HPC Facilities Source: hpcwire HPC interconnect](https://reader035.fdocuments.in/reader035/viewer/2022071019/5fd274bb08993c21022ecc48/html5/thumbnails/11.jpg)
Comparing with DragonFly Network – Load Imbalance Characterization
0.00010.0010.010.11
0 10 20 30 40 50 60 70 80 90
Inv.CDF
Range of PTS inequivalence groups(%)
localglobal
• Existence of load imbalance on local links• Load balance significantly improved on Global links
Existence of congestion cloud for long duration (NCSA Blue Waters)
(Cray Testbed System)
![Page 12: A Study of Network Congestion in Two Supercomputing High ... · Presented on August 16, 2019 at HOTI 26 Email: sjha8@Illinois.edu. DOE HPC Facilities Source: hpcwire HPC interconnect](https://reader035.fdocuments.in/reader035/viewer/2022071019/5fd274bb08993c21022ecc48/html5/thumbnails/12.jpg)
Demo – 5 minutes of congestion viz on EdisonCabinet Chassis
Blade Aries Switch
>15% PTS is dark blue0% - 15% green gradient
• Duration of congestion hotspots is short
• Hotspots move rapidly and application continues experience congestion
![Page 13: A Study of Network Congestion in Two Supercomputing High ... · Presented on August 16, 2019 at HOTI 26 Email: sjha8@Illinois.edu. DOE HPC Facilities Source: hpcwire HPC interconnect](https://reader035.fdocuments.in/reader035/viewer/2022071019/5fd274bb08993c21022ecc48/html5/thumbnails/13.jpg)
Going Forward: Taking Systems Approach for Minimizing Performance Variation
![Page 14: A Study of Network Congestion in Two Supercomputing High ... · Presented on August 16, 2019 at HOTI 26 Email: sjha8@Illinois.edu. DOE HPC Facilities Source: hpcwire HPC interconnect](https://reader035.fdocuments.in/reader035/viewer/2022071019/5fd274bb08993c21022ecc48/html5/thumbnails/14.jpg)
Sources of Contention
Multiple sources for contention/congestion
. . .<latexit sha1_base64="U0x7U4PkymoE4QbuPANSgG0aecc=">AAAB7XicbVBNS8NAEJ3Ur1q/qh69BKvgqSRV0GPBi8cK9gPaUDabTbt2sxt2J0Ip/Q9ePCji1f/jzX/jts1BWx8MPN6bYWZemApu0PO+ncLa+sbmVnG7tLO7t39QPjxqGZVpyppUCaU7ITFMcMmayFGwTqoZSULB2uHodua3n5g2XMkHHKcsSMhA8phTglZq9USk0PTLFa/qzeGuEj8nFcjR6Je/epGiWcIkUkGM6fpeisGEaORUsGmplxmWEjoiA9a1VJKEmWAyv3bqnlslcmOlbUl05+rviQlJjBknoe1MCA7NsjcT//O6GcY3wYTLNEMm6WJRnAkXlTt73Y24ZhTF2BJCNbe3unRINKFoAyrZEPzll1dJq1b1L6u1+6tK/SyPowgncAoX4MM11OEOGtAECo/wDK/w5ijnxXl3PhatBSefOYY/cD5/ALRnjx4=</latexit>
NICP1
NICP3
P2NIC
NIC P4
1 2
3
4 5
<dst> processor <dst> processor not ready to receive data
Network Switches Head of line blocking due to busy buffers
<src>/<dst> NIC
<src> processor
NIC buffers busy
Reasons
Cache conflict, TLB, OS resources
ContentionMap
1
2
5
4
3
Traffic flowing from P1 to P4 nic2procproc2nic
hl2hl
. . .<latexit sha1_base64="U0x7U4PkymoE4QbuPANSgG0aecc=">AAAB7XicbVBNS8NAEJ3Ur1q/qh69BKvgqSRV0GPBi8cK9gPaUDabTbt2sxt2J0Ip/Q9ePCji1f/jzX/jts1BWx8MPN6bYWZemApu0PO+ncLa+sbmVnG7tLO7t39QPjxqGZVpyppUCaU7ITFMcMmayFGwTqoZSULB2uHodua3n5g2XMkHHKcsSMhA8phTglZq9USk0PTLFa/qzeGuEj8nFcjR6Je/epGiWcIkUkGM6fpeisGEaORUsGmplxmWEjoiA9a1VJKEmWAyv3bqnlslcmOlbUl05+rviQlJjBknoe1MCA7NsjcT//O6GcY3wYTLNEMm6WJRnAkXlTt73Y24ZhTF2BJCNbe3unRINKFoAyrZEPzll1dJq1b1L6u1+6tK/SyPowgncAoX4MM11OEOGtAECo/wDK/w5ijnxXl3PhatBSefOYY/cD5/ALRnjx4=</latexit>
NICP1
NICP3
P2NIC
NIC P4
1 2
3
4 5
<dst> processor <dst> processor not ready to receive data
Network Switches Head of line blocking due to busy buffers
<src>/<dst> NIC
<src> processor
NIC buffers busy
Reasons
Cache conflict, TLB, OS resources
ContentionMap
1
2
5
4
3
Traffic flowing from P1 to P4
Traffic flowing from P1 to P4
![Page 15: A Study of Network Congestion in Two Supercomputing High ... · Presented on August 16, 2019 at HOTI 26 Email: sjha8@Illinois.edu. DOE HPC Facilities Source: hpcwire HPC interconnect](https://reader035.fdocuments.in/reader035/viewer/2022071019/5fd274bb08993c21022ecc48/html5/thumbnails/15.jpg)
ML-driven Performance Prediction in Aries
Feature Type Feature List Avg R2
Application and Scheduler Specific
app,hugepages,placement, balance, avg hops
0.64
Network Specific nic2proc,proc2nic,hl2hl PTS & SPF
0.86
All features All the above 0.91
• Diagnosis for understanding performance variation
• Understanding sensitivity to each configuration parameter
• Scheduling policy
Ensemble Models
k-Fold
Cross Validation
1. Increase observability into system and applications
2. Backpressure model for contention using probabilistic methods
3. Use explainable machine-learning methods
![Page 16: A Study of Network Congestion in Two Supercomputing High ... · Presented on August 16, 2019 at HOTI 26 Email: sjha8@Illinois.edu. DOE HPC Facilities Source: hpcwire HPC interconnect](https://reader035.fdocuments.in/reader035/viewer/2022071019/5fd274bb08993c21022ecc48/html5/thumbnails/16.jpg)
Conclusion and Future Work
Conclusion• Congestion studies across generations of production systems help improve
understanding of • App. performance variation – diagnostic models• Parameter tuning – selecting optimal parameters• Network design - load imbalance continues to be a problem
Future work• Characterize and understand upcoming QoS features in Slingshot• ML-driven scheduling for HPC kernels
![Page 17: A Study of Network Congestion in Two Supercomputing High ... · Presented on August 16, 2019 at HOTI 26 Email: sjha8@Illinois.edu. DOE HPC Facilities Source: hpcwire HPC interconnect](https://reader035.fdocuments.in/reader035/viewer/2022071019/5fd274bb08993c21022ecc48/html5/thumbnails/17.jpg)
References