An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments
description
Transcript of An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments
![Page 1: An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments](https://reader033.fdocuments.in/reader033/viewer/2022052416/5681692f550346895de075b1/html5/thumbnails/1.jpg)
An Analysis of 10-Gigabit Ethernet Protocol Stacks in
Multi-core Environments
G. Narayanaswamy, P. Balaji and W. Feng
Dept. of Comp. ScienceVirginia Tech
Mathematics and Comp. ScienceArgonne National Laboratory
![Page 2: An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments](https://reader033.fdocuments.in/reader033/viewer/2022052416/5681692f550346895de075b1/html5/thumbnails/2.jpg)
High-end Computing Trends• High-end Computing (HEC) Systems
– Continue to increase in scale and capability– Multicore architectures
• A significant driving force for this trend• Quad-core processors from Intel/AMD• IBM cell, SUN Niagara, Intel Terascale processor
– High-speed Network Interconnects• 10-Gigabit Ethernet (10GE), InfiniBand, Myrinet, Quadrics• Different stacks use different amounts of hardware support
• How do these two components interact with each other?
![Page 3: An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments](https://reader033.fdocuments.in/reader033/viewer/2022052416/5681692f550346895de075b1/html5/thumbnails/3.jpg)
Multicore Architectures
• Multi-processor vs. Multicore systems– Not all of the processor hardware is replicated for multicore
systems– Hardware units such as cache might be shared between the
different cores– Multiple processing units embedded on the same processor
die inter-core communication faster than inter-processor communication
• On most architectures (Intel, AMD, SUN), all cores are equally powerful makes scheduling easier
![Page 4: An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments](https://reader033.fdocuments.in/reader033/viewer/2022052416/5681692f550346895de075b1/html5/thumbnails/4.jpg)
Interactions of Protocols with Multicores• Depending on how the stack works, different protocols
have different interactions with multicore systems• Study based on host-based TCP/IP and iWARP• TCP/IP has significant interaction with multicore systems
– Large impacts on application performance• iWARP stack itself does not interact directly with multicore
systems– Software libraries built on top of iWARP DO interact
(buffering of data, copies)– Interaction similar to other high performance protocols
(InfiniBand, Myrinet MX, Qlogic PSM)
![Page 5: An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments](https://reader033.fdocuments.in/reader033/viewer/2022052416/5681692f550346895de075b1/html5/thumbnails/5.jpg)
TCP/IP Interaction vs. iWARP Interaction
Network
TCP/IP stack
App App App
iWARP offloadedNetwork
Library
App App App
Library Library
TCP/IP is some ways more asynchronous or “centralized” with respect to host-processing as compared to iWARP (or other high performance software stacks)
Packet Arrival
Packet Processing
Packet Arrival
Packet Processing
Host-processing independent of
application process (statically tied to a
single core)
Host-processing closely tied to
application process
![Page 6: An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments](https://reader033.fdocuments.in/reader033/viewer/2022052416/5681692f550346895de075b1/html5/thumbnails/6.jpg)
Presentation Layout
• Introduction and Motivation
• Treachery of Multicore Architectures
• Application Process to Core Mapping Techniques
• Conclusions and Future Work
![Page 7: An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments](https://reader033.fdocuments.in/reader033/viewer/2022052416/5681692f550346895de075b1/html5/thumbnails/7.jpg)
MPI Bandwidth over TCP/IPIntel Platform
0
500
1000
1500
2000
2500
3000
35001 4 16 64 256
1K 4K 16K
64K
256K 1M 4M
Message Size (bytes)
Ban
dwid
th (M
bps)
Core 0
Core 1
Core 2
Core 3
AMD Platform
0
500
1000
1500
2000
2500
3000
1 4 16 64 256
1K 4K 16K
64K
256K 1M 4M
Message Size (bytes)
Ban
dwid
th (M
bps)
Core 0
Core 1
Core 2
Core 3
![Page 8: An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments](https://reader033.fdocuments.in/reader033/viewer/2022052416/5681692f550346895de075b1/html5/thumbnails/8.jpg)
MPI Bandwidth over iWARPIntel Platform
0
1000
2000
3000
4000
5000
6000
70001 4 16 64 256
1K 4K 16K
64K
256K 1M 4M
Message Size (bytes)
Ban
dwid
th (M
bps)
Core 0
Core 1
Core 2
Core 3
AMD Platform
0
1000
2000
3000
4000
5000
6000
7000
8000
1 4 16 64 256
1K 4K 16K
64K
256K 1M 4M
Message Size (bytes)
Ban
dwid
th (M
bps)
Core 0
Core 1
Core 2
Core 3
![Page 9: An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments](https://reader033.fdocuments.in/reader033/viewer/2022052416/5681692f550346895de075b1/html5/thumbnails/9.jpg)
TCP/IP Interrupts and Cache MissesHardware Interrupts
0.01
0.1
1
10
100
1000
10000
1000001 4 16 64 256
1K 4K 16K
64K
256K 1M 4M
Message Size (bytes)
Inte
rrupt
s pe
r Mes
sage
Core 0
Core 1
Core 2
Core 3
L2 Cache Misses
-50
0
50
100
150
200
250
1 4 16 64 256
1K 4K 16K
64K
256K 1M 4M
Message Size (bytes)
Per
cent
age
Diff
eren
ce
Core 0
Core 1
Core 2
Core 3
![Page 10: An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments](https://reader033.fdocuments.in/reader033/viewer/2022052416/5681692f550346895de075b1/html5/thumbnails/10.jpg)
MPI Latency over TCP/IP (Intel Platform)Small Message Latency
0
5
10
15
20
25
30
35
40
45
50
1 4 16 64 256 1K 4K
Message Size (bytes)
Ban
dwid
th (M
bps)
Core 0 Core 1
Core 2 Core 3
Large Message Latency
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
128K 256K 512K 1M 2M 4MMessage Size (bytes)
Ban
dwid
th (M
bps)
Core 0
Core 1
Core 2
Core 3
![Page 11: An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments](https://reader033.fdocuments.in/reader033/viewer/2022052416/5681692f550346895de075b1/html5/thumbnails/11.jpg)
Presentation Layout
• Introduction and Motivation
• Treachery of Multicore Architectures
• Application Process to Core Mapping Techniques
• Conclusions and Future Work
![Page 12: An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments](https://reader033.fdocuments.in/reader033/viewer/2022052416/5681692f550346895de075b1/html5/thumbnails/12.jpg)
Application Behavior Pre-analysis
• A four-core system is effectively a 3.5 core system– A part of a core has to be dedicated to communication– Interrupts, Cache misses
• How do we schedule 4 application processes on 3.5 cores?
• If the application is exactly synchronized, there is not much we can do
• Otherwise, we have an opportunity!• Study with GROMACS and LAMMPS
![Page 13: An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments](https://reader033.fdocuments.in/reader033/viewer/2022052416/5681692f550346895de075b1/html5/thumbnails/13.jpg)
GROMACS Overview• Developed by Groningen University• Simulates the molecular dynamics of biochemical particles• The root distributes a “topology” file corresponding to the
molecular structure• Simulation time broken down into a number of steps
– Processes synchronize at each step• Performance reported as number of nanoseconds of
molecular interactions that can be simulated each day
Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7
Combination A 0 4 2 6 7 3 5 1
Combination B 0 2 4 6 5 1 3 7
![Page 14: An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments](https://reader033.fdocuments.in/reader033/viewer/2022052416/5681692f550346895de075b1/html5/thumbnails/14.jpg)
GROMACS: Random SchedulingGromacs LZM Application
0
5
10
15
20
25
30
TCP/IP iWARP
ns/d
ay
Combination A
Combination B
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0 1 2 3 0 1 2 3
Computation MPI_Wait Other MPI calls
Machine 1 cores Machine 2 cores
![Page 15: An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments](https://reader033.fdocuments.in/reader033/viewer/2022052416/5681692f550346895de075b1/html5/thumbnails/15.jpg)
GROMACS: Selective SchedulingGromacs LZM Application
0
5
10
15
20
25
30
TCP/IP iWARP
ns/d
ay
Combination A
Combination B
Combination A'Combination B'
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0 1 2 3 0 1 2 3
Computation MPI_Wait Other MPI calls
Machine 1 cores Machine 2 cores
![Page 16: An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments](https://reader033.fdocuments.in/reader033/viewer/2022052416/5681692f550346895de075b1/html5/thumbnails/16.jpg)
LAMMPS Overview• Molecular dynamics simulator developed at Sandia• Uses spatial decomposition techniques to partition the
simulation domain into smaller 3-D subdomains– Each subdomain allotted to a different process– Interaction required only between neighboring subdomains –
improves scalability• Used the Lennard-Jones liquid simulation within LAMMPS
Core 0 Core 1 Core 2 Core 3
Core 0 Core 1 Core 2 Core 3Network
![Page 17: An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments](https://reader033.fdocuments.in/reader033/viewer/2022052416/5681692f550346895de075b1/html5/thumbnails/17.jpg)
LAMMPS: Random SchedulingLAMMPS Application
0
2
4
6
8
10
12
TCP/IP iWARP
Com
mun
icat
ion
Tim
e (s
econ
ds)
Combination A
Combination B
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0 1 2 3 0 1 2 3
MPI_Wait MPI_Send Other MPI calls
Machine 1 cores Machine 2 cores
![Page 18: An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments](https://reader033.fdocuments.in/reader033/viewer/2022052416/5681692f550346895de075b1/html5/thumbnails/18.jpg)
LAMMPS: Intended Communication Pattern
Computation
MPI_Send() MPI_Send()
MPI_Irecv() MPI_Irecv()
MPI_Wait() MPI_Wait()
MPI_Send() MPI_Send()
MPI_Irecv() MPI_Irecv()
![Page 19: An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments](https://reader033.fdocuments.in/reader033/viewer/2022052416/5681692f550346895de075b1/html5/thumbnails/19.jpg)
LAMMPS: Actual Communication Pattern
Computation
MPI_Send() MPI_Send()
MPI_Wait()
MPI_Wait()
MPI buffer
Socket Send Buffer
Socket Recv BufferApplication Recv Buffer
MPI_Send()
Application Recv Buffer
“Slower” Core Faster Core
MPI buffer
Socket Send Buffer
Socket Recv Buffer
MPI_Send()
Application Recv Buffer
“Slower” Core Faster Core
Computation
“Out-of-Sync” Communication between processes
![Page 20: An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments](https://reader033.fdocuments.in/reader033/viewer/2022052416/5681692f550346895de075b1/html5/thumbnails/20.jpg)
LAMMPS: Selective SchedulingLAMMPS Application
0
2
4
6
8
10
12
TCP/IP iWARP
Com
mun
icat
ion
Tim
e (s
econ
ds)
Combination ACombination BCombination A'Combination B'
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0 1 2 3 0 1 2 3
MPI_Wait MPI_Send Other MPI calls
Machine 1 cores Machine 2 cores
![Page 21: An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments](https://reader033.fdocuments.in/reader033/viewer/2022052416/5681692f550346895de075b1/html5/thumbnails/21.jpg)
Presentation Layout
• Introduction and Motivation
• Treachery of Multicore Architectures
• Application Process to Core Mapping Techniques
• Conclusions and Future Work
![Page 22: An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments](https://reader033.fdocuments.in/reader033/viewer/2022052416/5681692f550346895de075b1/html5/thumbnails/22.jpg)
Concluding Remarks and Future Work• Multicore architectures and high-speed networks are
becoming prominent in high-end computing systems– Interaction of these components is important and interesting!– For TCP/IP scheduling order drastically impacts performance– For iWARP scheduling order has no overhead– Scheduling processes in a more intelligent manner allows
significantly improved application performance– Does not impact iWARP and other high-performance stack
making the approach portable while efficient
• Dynamic process to core scheduling!
![Page 23: An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments](https://reader033.fdocuments.in/reader033/viewer/2022052416/5681692f550346895de075b1/html5/thumbnails/23.jpg)
Thank You
Contacts:
Ganesh Narayanaswamy: [email protected]
Pavan Balaji: [email protected]
Wu-chun Feng: [email protected]
For More Information:
http://synergy.cs.vt.edu
http://www.mcs.anl.gov/~balaji
![Page 24: An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments](https://reader033.fdocuments.in/reader033/viewer/2022052416/5681692f550346895de075b1/html5/thumbnails/24.jpg)
Backup Slides
![Page 25: An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments](https://reader033.fdocuments.in/reader033/viewer/2022052416/5681692f550346895de075b1/html5/thumbnails/25.jpg)
MPI Latency over TCP/IP (AMD Platform)Small Message Latency
0
5
10
15
20
25
30
35
40
45
50
1 4 16 64 256 1K 4K
Message Size (bytes)
Ban
dwid
th (M
bps)
Core 0 Core 1
Core 2 Core 3
Large Message Latency
0
5000
10000
15000
20000
25000
128K 256K 512K 1M 2M 4MMessage Size (bytes)
Ban
dwid
th (M
bps)
Core 0
Core 1
Core 2
Core 3