Ethernet and TCP optimizations
-
Upload
jeff-squyres -
Category
Technology
-
view
1.985 -
download
3
description
Transcript of Ethernet and TCP optimizations
![Page 1: Ethernet and TCP optimizations](https://reader036.fdocuments.in/reader036/viewer/2022062405/555151c0b4c905e1708b45cc/html5/thumbnails/1.jpg)
Cisco Confidential 1© 2012 Cisco and/or its affiliates. All rights reserved.
Ethernet: Hidden Secrets Jeff Squyres
![Page 2: Ethernet and TCP optimizations](https://reader036.fdocuments.in/reader036/viewer/2022062405/555151c0b4c905e1708b45cc/html5/thumbnails/2.jpg)
First: some backgroundinformation…
![Page 3: Ethernet and TCP optimizations](https://reader036.fdocuments.in/reader036/viewer/2022062405/555151c0b4c905e1708b45cc/html5/thumbnails/3.jpg)
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 3
Jeff’s work: Parallel computing at Cisco
Using lots and lots and lots of servers simultaneouslyto solve one computational problem
![Page 4: Ethernet and TCP optimizations](https://reader036.fdocuments.in/reader036/viewer/2022062405/555151c0b4c905e1708b45cc/html5/thumbnails/4.jpg)
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 4
Supercomputing applications
Racks of36 1U
servers
Tend to send lots and lots and lots of small messagesacross the network to stay in sync with each other
![Page 5: Ethernet and TCP optimizations](https://reader036.fdocuments.in/reader036/viewer/2022062405/555151c0b4c905e1708b45cc/html5/thumbnails/5.jpg)
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 5
Network message traversal
Underlying network
Send amessage
Receive themessage
A B
![Page 6: Ethernet and TCP optimizations](https://reader036.fdocuments.in/reader036/viewer/2022062405/555151c0b4c905e1708b45cc/html5/thumbnails/6.jpg)
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 6
Network message traversal
Underlying network
Send amessage
Receive themessage
Today’s fastest networks:1-3μs (!)
A B
![Page 7: Ethernet and TCP optimizations](https://reader036.fdocuments.in/reader036/viewer/2022062405/555151c0b4c905e1708b45cc/html5/thumbnails/7.jpg)
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 7
Today’s fastest networks
• Typically not Ethernet networks
• Usually have supercomputer-specific networksExample: highly tuned for short message latency
• …but that is changing
Ethernet Ethernot
![Page 8: Ethernet and TCP optimizations](https://reader036.fdocuments.in/reader036/viewer/2022062405/555151c0b4c905e1708b45cc/html5/thumbnails/8.jpg)
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 8
Cisco’s ultra low latency Ethernet
• Userspace NIC (“USNIC”)Expose Cisco NIC hardware directly to Linux userspace
Bypass the OS
Bypass the TCP stack
• Send raw Ethernet frames directly from user applicationsMuch, much faster than traditional TCP-based networking
Especially for latency of short messages
![Page 9: Ethernet and TCP optimizations](https://reader036.fdocuments.in/reader036/viewer/2022062405/555151c0b4c905e1708b45cc/html5/thumbnails/9.jpg)
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 9
Kernel
Cisco VIC hardware
TCP / IP stack
Cisco VIC driver
Normal TCP software architecture
UserspaceUserspace sockets library
MPI library
Application
![Page 10: Ethernet and TCP optimizations](https://reader036.fdocuments.in/reader036/viewer/2022062405/555151c0b4c905e1708b45cc/html5/thumbnails/10.jpg)
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 10
Kernel
Userspace verbs library
Cisco VIC hardware
Cisco USNIC software
MPI library
Userspace
Verbs IB core
Cisco USNIC driver
Bootstrappingand setup
Send and receivefast path
Application
![Page 11: Ethernet and TCP optimizations](https://reader036.fdocuments.in/reader036/viewer/2022062405/555151c0b4c905e1708b45cc/html5/thumbnails/11.jpg)
With all that background…
![Page 12: Ethernet and TCP optimizations](https://reader036.fdocuments.in/reader036/viewer/2022062405/555151c0b4c905e1708b45cc/html5/thumbnails/12.jpg)
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 12
Doing some performance testing last week…
Two servers
![Page 13: Ethernet and TCP optimizations](https://reader036.fdocuments.in/reader036/viewer/2022062405/555151c0b4c905e1708b45cc/html5/thumbnails/13.jpg)
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 13
Doing some performance testing last week…
Two servers
Each with a 2 x 10Gb NIC
![Page 14: Ethernet and TCP optimizations](https://reader036.fdocuments.in/reader036/viewer/2022062405/555151c0b4c905e1708b45cc/html5/thumbnails/14.jpg)
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 14
Doing some performance testing last week…
Two servers
Each with a 2 x 10Gb NICConnected back-to-back
![Page 15: Ethernet and TCP optimizations](https://reader036.fdocuments.in/reader036/viewer/2022062405/555151c0b4c905e1708b45cc/html5/thumbnails/15.jpg)
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 15
“Ping pong” latency test
Send a messagefrom here
Receive the messagehere
Ping!
![Page 16: Ethernet and TCP optimizations](https://reader036.fdocuments.in/reader036/viewer/2022062405/555151c0b4c905e1708b45cc/html5/thumbnails/16.jpg)
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 16
“Ping pong” latency test
Get the messageback
Send the messageback
Pong!
![Page 17: Ethernet and TCP optimizations](https://reader036.fdocuments.in/reader036/viewer/2022062405/555151c0b4c905e1708b45cc/html5/thumbnails/17.jpg)
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 17
“Ping pong” latency test
Because each ping and pong are soooo short,do this ping-pong exchange N times
Ping! / Pong!
![Page 18: Ethernet and TCP optimizations](https://reader036.fdocuments.in/reader036/viewer/2022062405/555151c0b4c905e1708b45cc/html5/thumbnails/18.jpg)
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 18
“Ping pong” latency test
Total time for N ping-pongs
N
Time for one ping-pong
![Page 19: Ethernet and TCP optimizations](https://reader036.fdocuments.in/reader036/viewer/2022062405/555151c0b4c905e1708b45cc/html5/thumbnails/19.jpg)
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 19
“Ping pong” latency test
Total time for N ping-pongs
N
Time for one ping-pong
2
Time for one ping
![Page 20: Ethernet and TCP optimizations](https://reader036.fdocuments.in/reader036/viewer/2022062405/555151c0b4c905e1708b45cc/html5/thumbnails/20.jpg)
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 20
Time for one ping
Half-round trip (HRT)ping pong latency
=
![Page 21: Ethernet and TCP optimizations](https://reader036.fdocuments.in/reader036/viewer/2022062405/555151c0b4c905e1708b45cc/html5/thumbnails/21.jpg)
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 21
Results: using 1x10G Ethernet port
1 byte~60μs
8MB~150ms
![Page 22: Ethernet and TCP optimizations](https://reader036.fdocuments.in/reader036/viewer/2022062405/555151c0b4c905e1708b45cc/html5/thumbnails/22.jpg)
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 22
Results: using 2x10G Ethernet ports
1 byte~60μs
8MB~150ms
8MB~8.3ms
1 byte~30μs (!)
![Page 23: Ethernet and TCP optimizations](https://reader036.fdocuments.in/reader036/viewer/2022062405/555151c0b4c905e1708b45cc/html5/thumbnails/23.jpg)
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 23
Results: using 2x10G Ethernet ports
1 byte~60μs
8MB~150ms
8MB~8.3ms
1 byte~30μs (!)
WHOA!
![Page 24: Ethernet and TCP optimizations](https://reader036.fdocuments.in/reader036/viewer/2022062405/555151c0b4c905e1708b45cc/html5/thumbnails/24.jpg)
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 24
Results: just the small messages
The facts:From 1-1024 bytes: flat latency
Using 1 interface: ~60μsUsing 2 interfaces: ~30μs
~60μs
~30μs
![Page 25: Ethernet and TCP optimizations](https://reader036.fdocuments.in/reader036/viewer/2022062405/555151c0b4c905e1708b45cc/html5/thumbnails/25.jpg)
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 25
Results: just the small messages
The facts:From 1-1024 bytes: flat latency
Using 1 interface: ~60μsUsing 2 interfaces: ~30μs
~60μs
~30μsWHY?
![Page 26: Ethernet and TCP optimizations](https://reader036.fdocuments.in/reader036/viewer/2022062405/555151c0b4c905e1708b45cc/html5/thumbnails/26.jpg)
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 26
Must look at how TCP works…
1. Ethernet frame arrives
![Page 27: Ethernet and TCP optimizations](https://reader036.fdocuments.in/reader036/viewer/2022062405/555151c0b4c905e1708b45cc/html5/thumbnails/27.jpg)
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 27
Must look at how TCP works…
1. Ethernet frame arrives
2. NIC sends interruptto OS Ethernet driver
![Page 28: Ethernet and TCP optimizations](https://reader036.fdocuments.in/reader036/viewer/2022062405/555151c0b4c905e1708b45cc/html5/thumbnails/28.jpg)
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 28
Must look at how TCP works…
1. Ethernet frame arrives
2. NIC sends interruptto OS Ethernet driver
3. OS Ethernet drivercopies the packet to RAM
![Page 29: Ethernet and TCP optimizations](https://reader036.fdocuments.in/reader036/viewer/2022062405/555151c0b4c905e1708b45cc/html5/thumbnails/29.jpg)
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 29
Must look at how TCP works…
1. Ethernet frame arrives
2. NIC sends interruptto OS Ethernet driver
3. OS Ethernet drivercopies the packet to RAM
4. OS TCP stack handspacket off to (whatever)
![Page 30: Ethernet and TCP optimizations](https://reader036.fdocuments.in/reader036/viewer/2022062405/555151c0b4c905e1708b45cc/html5/thumbnails/30.jpg)
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 30
The Costco Rule
It’s always better in bulk
![Page 31: Ethernet and TCP optimizations](https://reader036.fdocuments.in/reader036/viewer/2022062405/555151c0b4c905e1708b45cc/html5/thumbnails/31.jpg)
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 31
Why copy one packet at a time?
Let’s optimizethis part
![Page 32: Ethernet and TCP optimizations](https://reader036.fdocuments.in/reader036/viewer/2022062405/555151c0b4c905e1708b45cc/html5/thumbnails/32.jpg)
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 32
Two (commonly used) optimizations
1. Copy a bunch ofpackets across PCI
at one time
![Page 33: Ethernet and TCP optimizations](https://reader036.fdocuments.in/reader036/viewer/2022062405/555151c0b4c905e1708b45cc/html5/thumbnails/33.jpg)
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 33
Two (commonly used) optimizations
2. Only raise oneinterrupt for all of
those packet copies
1. Copy a bunch ofpackets across PCI
at one time
![Page 34: Ethernet and TCP optimizations](https://reader036.fdocuments.in/reader036/viewer/2022062405/555151c0b4c905e1708b45cc/html5/thumbnails/34.jpg)
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 34
Two (commonly used) optimizations
2. Only raise oneinterrupt for all of
those packet copies
1. Copy a bunch ofpackets across PCI
at one time
A.k.a. “Interrupt Coalescing”
![Page 35: Ethernet and TCP optimizations](https://reader036.fdocuments.in/reader036/viewer/2022062405/555151c0b4c905e1708b45cc/html5/thumbnails/35.jpg)
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 35
Interrupt coalescing
1. Ethernet frame arrives
![Page 36: Ethernet and TCP optimizations](https://reader036.fdocuments.in/reader036/viewer/2022062405/555151c0b4c905e1708b45cc/html5/thumbnails/36.jpg)
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 36
Interrupt coalescing
1. Ethernet frame arrives
2. Has N time passedsince we sent an
interrupt to the OS?
![Page 37: Ethernet and TCP optimizations](https://reader036.fdocuments.in/reader036/viewer/2022062405/555151c0b4c905e1708b45cc/html5/thumbnails/37.jpg)
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 37
Interrupt coalescing
1. Ethernet frame arrives
2. Has N time passedsince we sent an
interrupt to the OS?
No: queue up the frame✖
![Page 38: Ethernet and TCP optimizations](https://reader036.fdocuments.in/reader036/viewer/2022062405/555151c0b4c905e1708b45cc/html5/thumbnails/38.jpg)
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 38
Interrupt coalescing
1. Ethernet frame arrives
2. Has N time passedsince we sent an
interrupt to the OS?
No: queue up the frame✖✔ Yes: Send all queued frames and interrupt
![Page 39: Ethernet and TCP optimizations](https://reader036.fdocuments.in/reader036/viewer/2022062405/555151c0b4c905e1708b45cc/html5/thumbnails/39.jpg)
Ok… So what?
![Page 40: Ethernet and TCP optimizations](https://reader036.fdocuments.in/reader036/viewer/2022062405/555151c0b4c905e1708b45cc/html5/thumbnails/40.jpg)
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 40
The key: NIC interrupt coalescing timers
![Page 41: Ethernet and TCP optimizations](https://reader036.fdocuments.in/reader036/viewer/2022062405/555151c0b4c905e1708b45cc/html5/thumbnails/41.jpg)
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 41
Timeline of a ping pong
NIC A
NIC B
1. A sends ping frame
2. B receives ping frame
Periodic interruptcoalescing timeout
125μs
![Page 42: Ethernet and TCP optimizations](https://reader036.fdocuments.in/reader036/viewer/2022062405/555151c0b4c905e1708b45cc/html5/thumbnails/42.jpg)
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 42
Timeline of a ping pong
NIC A
NIC B
3. Coalesce timer expires; B sends interrupt4. B sends pong frame
![Page 43: Ethernet and TCP optimizations](https://reader036.fdocuments.in/reader036/viewer/2022062405/555151c0b4c905e1708b45cc/html5/thumbnails/43.jpg)
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 43
Timeline of a ping pong
NIC A
NIC B
5. Coalesce timer expires; A sends interrupt6. A sends ping frame7. Rinse, repeat
![Page 44: Ethernet and TCP optimizations](https://reader036.fdocuments.in/reader036/viewer/2022062405/555151c0b4c905e1708b45cc/html5/thumbnails/44.jpg)
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 44
Timeline of a ping pong
NIC A
NIC B
4 ping-pongs in ~8x timer duration
![Page 45: Ethernet and TCP optimizations](https://reader036.fdocuments.in/reader036/viewer/2022062405/555151c0b4c905e1708b45cc/html5/thumbnails/45.jpg)
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 45
Timeline of a ping pong
NIC A
NIC B
In general, coalescing interrupts is a very Very Good Thing
![Page 46: Ethernet and TCP optimizations](https://reader036.fdocuments.in/reader036/viewer/2022062405/555151c0b4c905e1708b45cc/html5/thumbnails/46.jpg)
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 46
Timeline of a ping pong
NIC A
NIC B
But it definitely hurts low-latency traffic
![Page 47: Ethernet and TCP optimizations](https://reader036.fdocuments.in/reader036/viewer/2022062405/555151c0b4c905e1708b45cc/html5/thumbnails/47.jpg)
How do we reduce those artificial delays?
![Page 48: Ethernet and TCP optimizations](https://reader036.fdocuments.in/reader036/viewer/2022062405/555151c0b4c905e1708b45cc/html5/thumbnails/48.jpg)
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 48
Two Ethernet ports with out-of-sync timers
NIC A
NIC B
NIC A
NIC B
Por
t 0
Por
t 1
![Page 49: Ethernet and TCP optimizations](https://reader036.fdocuments.in/reader036/viewer/2022062405/555151c0b4c905e1708b45cc/html5/thumbnails/49.jpg)
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 49
Get more round trips in same amount of time
NIC A
NIC B
NIC A
NIC B
Por
t 0
Por
t 1
![Page 50: Ethernet and TCP optimizations](https://reader036.fdocuments.in/reader036/viewer/2022062405/555151c0b4c905e1708b45cc/html5/thumbnails/50.jpg)
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 50
Get more round trips in same amount of time
NIC A
NIC B
NIC A
NIC B
Por
t 0
Por
t 1
In reality, sender and receiver timers on each port are wholly unrelated; they don’t line up
nicely like I used in these examples.
Meaning: in general, you actually usually get better overlap
![Page 51: Ethernet and TCP optimizations](https://reader036.fdocuments.in/reader036/viewer/2022062405/555151c0b4c905e1708b45cc/html5/thumbnails/51.jpg)
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 51
Results: just the small messages
~60μs
~30μs
In this case, we got such good asymmetry, that the 2 port case is ~2x as fast (i.e., roughly twice as many interrupts in the same amount of time)
![Page 52: Ethernet and TCP optimizations](https://reader036.fdocuments.in/reader036/viewer/2022062405/555151c0b4c905e1708b45cc/html5/thumbnails/52.jpg)
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 52
Lies, damn lies, and statistics
Remember:these are AVERAGE
latencies!
Individual ping-pong timesare the same as the
1 port case (from the network)
…but you get higher throughputbecause we’re reducing the
gaps between each ping-pong
![Page 53: Ethernet and TCP optimizations](https://reader036.fdocuments.in/reader036/viewer/2022062405/555151c0b4c905e1708b45cc/html5/thumbnails/53.jpg)
Now let’s trysomething else…
![Page 54: Ethernet and TCP optimizations](https://reader036.fdocuments.in/reader036/viewer/2022062405/555151c0b4c905e1708b45cc/html5/thumbnails/54.jpg)
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 54
Set the coalesce timer at 0
![Page 55: Ethernet and TCP optimizations](https://reader036.fdocuments.in/reader036/viewer/2022062405/555151c0b4c905e1708b45cc/html5/thumbnails/55.jpg)
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 55
New ping-pongs much faster!
1 port~10.5μs
2 ports~10.6μs
1 port~7.2ms
2 ports~5.5ms
![Page 56: Ethernet and TCP optimizations](https://reader036.fdocuments.in/reader036/viewer/2022062405/555151c0b4c905e1708b45cc/html5/thumbnails/56.jpg)
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 56
What are the tradeoffs?
Pros• (Much) faster TCP latency
…without changing app!
• Faster speeds seem to scale up to large messages, too
• Great for low-latency, sparse comms apps
• Best for NICs that are dedicated to MPI comms
Cons• May not scale well for
case of MPI process running on every core
• Lots and lots of interrupts going to socket:0.core:0
• May need to run (N-1) MPI processes…?
May also want to avoid socket:0.core:0, or move IRQ affinity
![Page 57: Ethernet and TCP optimizations](https://reader036.fdocuments.in/reader036/viewer/2022062405/555151c0b4c905e1708b45cc/html5/thumbnails/57.jpg)
Your mileage may vary
![Page 58: Ethernet and TCP optimizations](https://reader036.fdocuments.in/reader036/viewer/2022062405/555151c0b4c905e1708b45cc/html5/thumbnails/58.jpg)
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 58
But it’s interesting, nonetheless!
• Some experimentation might be worth trying with real world HPC apps:
• Allow TCP to wholly utilize core 0 (i.e., run MPI processes only on cores 1-15)
• Set the coalesce timer to something more than 0μs, but less than 125μs – there’s a whole spectrum with which to play
![Page 59: Ethernet and TCP optimizations](https://reader036.fdocuments.in/reader036/viewer/2022062405/555151c0b4c905e1708b45cc/html5/thumbnails/59.jpg)
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 59
My overall points:
• Many in HPC have Ethernot networks …but as HPC continues to commoditize itself, lots of HPC users have Ethernet-based environments
• Today’s Ethernet switches and NICs are actually quite a bit faster and more advanced than what we old-time-HPCers grew up with
• Even good ol’ TCP is amazingly fast and optimized today
• You may be able to tune your NIC and/or fabric to extract pretty darn good MPI TCP performance
The default settings on your Ethernet NIC / fabric are likely set for general TCP traffic – which effect very different performance characteristics than what HPC applications typically need
![Page 60: Ethernet and TCP optimizations](https://reader036.fdocuments.in/reader036/viewer/2022062405/555151c0b4c905e1708b45cc/html5/thumbnails/60.jpg)
Thank you.