Internetworking: Hardware/Software Interface
-
Upload
wallace-barrett -
Category
Documents
-
view
32 -
download
1
description
Transcript of Internetworking: Hardware/Software Interface
Internetworking: Hardware/Software Interface
CS 213, LECTURE 16L.N. Bhuyan
04/19/23 CS258 S99 2
Protocols: HW/SW Interface
• Internetworking: allows computers on independent and incompatible networks to communicate reliably and efficiently;
– Enabling technologies: SW standards that allow reliable communications without reliable networks
– Hierarchy of SW layers, giving each layer responsibility for portion of overall communications task, called protocol families or protocol suites
• Transmission Control Protocol/Internet Protocol (TCP/IP)
– This protocol family is the basis of the Internet
– IP makes best effort to deliver; TCP guarantees delivery
– TCP/IP used even when communicating locally: NFS uses IP even though communicating across homogeneous LAN
04/19/23 CS258 S99 3
TCP/IP packet
• Application sends message• TCP breaks into 64KB
segements, adds 20B header
• IP adds 20B header, sends to network
• If Ethernet, broken into 1500B packets with headers, trailers
• Header, trailers have length field, destination, window number, version, ...
TCP data(≤ 64KB)
TCP Header
IP Header
IP Data
Ethernet
CPU
User
Kernel
NIC NIC
PCI Bus
Communicating with the Server: The O/S Wall
Problems:• O/S overhead to move a packet between network and application level => Protocol Stack (TCP/IP)
• O/S interrupt • Data copying from kernel space to user space and vice versa
• Oh, the PCI Bottleneck!
04/19/23 CS258 S99 5
The Send/Receive Operation
• The application writes the transmit data to the TCP/IP sockets interface for transmission in payload sizes ranging from 4 KB to 64 KB.
• The data is copied from the User space to the Kernel space
• The OS segments the data into maximum transmission unit (MTU)–size packets, and then adds TCP/IP header information to each packet.
• The OS copies the data onto the network interface card (NIC) send queue.
• The NIC performs the direct memory access (DMA) transfer of each data packet from the TCP buffer space to the NIC, and interrupts CPU activities to indicate completion of the transfer.
04/19/23 CS258 S99 6
Transmitting data across the memory bus using a standard NIC
http://www.dell.com/downloads/global/power/1q04-her.pdf
04/19/23 CS258 S99 7
Timing Measurement in UDP Communication
X.Zhang, L. Bhuyan and W. Feng, ““Anatomy of UDP and M-VIA for Cluster Communication” JPDC, October 2005
04/19/23 CS258 S99 8
I/O Acceleration Techniques
• TCP Offload: Offload TCP/IP Checksum and Segmentation to Interface hardware or programmable device (Ex. TOEs) – A TOE-enabled NIC using Remote Direct Memory Access (RDMA) can use zero-copy algorithms to place data directly into application buffers.
• O/S Bypass: User-level software techniques to bypass protocol stack – Zero Copy Protocol
(Needs programmable device in the NIC for direct user level memory access – Virtual to Physical Memory Mapping. Ex. VIA)
• Architectural Techniques: Instruction set optimization, Multithreading, copy engines, onloading, prefetching, etc.
04/19/23 CS258 S99 9
Comparing standard TCP/IP and TOE enabled TCP/IP stacks
(http://www.dell.com/downloads/global/power/1q04-her.pdf)
04/19/23 CS258 S99 10
Chelsio 10 Gbs TOE
04/19/23 CS258 S99 11
Cluster (Network) of Workstations/PCs
04/19/23 CS258 S99 12
Myrinet Interface Card
04/19/23 CS258 S99 13
InfiniBand Interconnection
• Zero-copy mechanism. The zero-copy mechanism enables a user-level application to perform I/O on the InfiniBand fabric without being required to copy data between user space and kernel space.
• RDMA. RDMA facilitates transferring data from remote memory to local memory without the involvement of host CPUs.
• Reliable transport services. The InfiniBand architecture implements reliable transport services so the host CPU is not involved in protocol-processing tasks like segmentation, reassembly, NACK/ACK, etc.
• Virtual lanes. InfiniBand architecture provides 16 virtual lanes (VLs) to multiplex independent data lanes into the same physical lane, including a dedicated VL for management operations.
• High link speeds. InfiniBand architecture defines three link speeds, which are characterized as 1X, 4X, and 12X, yielding data rates of 2.5 Gbps, 10 Gbps, and 30 Gbps, respectively.
Reprinted from Dell Power Solutions, October 2004. BY ONUR
CELEBIOGLU, RAMESH RAJAGOPALAN, AND RIZWAN ALI
04/19/23 CS258 S99 14
InfiniBand system fabric
04/19/23 CS258 S99 15
UDP Communication – Life of a Packet
X. Zhang, L. Bhuyan and W. Feng, “Anatomy of UDP and M-VIA for Cluster Communication” Journal of Parallel and Distributed Computing (JPDC), Special issue on Design and Performance of Networks for Super-, Cluster-, and Grid-Computing, Vol. 65, Issue 10, October 2005, pp. 1290-1298.
04/19/23 CS258 S99 16
Timing Measurement in UDP Communication
X.Zhang, L. Bhuyan and W. Feng, ““Anatomy of UDP and M-VIA for Cluster Communication” JPDC, October 2005
04/19/23 CS258 S99 17
Network Bandwidth is Increasing
1010
100100
4040
GH
z a
nd
Gb
ps
GH
z a
nd
Gb
ps
TimeTime19901990 19951995 20002000 20032003 20052005 20102010
.01.01
0.10.1
11
1010
100100
10001000
2006/72006/7
Network bandwidth outpaces
Moore’s Law
Moore’s Law
TCP requirements Rule of thumb:1GHz for 1Gbps
The gap between the rate of processing network applications
and the fast growing network bandwidth is increasing
04/19/23 CS258 S99 18
Profile of a Packet
System Overheads
Descriptor & Header Accesses
Total Avg Clocks / Packet: ~ 21KEffective Bandwidth: 0.6 Gb/s
(1KB Receive)
IP Processing
TCB Accesses
TCP Processing
Memory Copy
Computes
Memory
04/19/23 CS258 S99 19
Five Emerging Technologies
• Optimized Network Protocol Stack (ISSS+CODES, 2003)
• Cache Optimization (ISSS+CODES, 2003, ANCHOR, 2004)
• Network Stack Affinity Scheduling
• Direct Cache Access
• Lightweight Threading
• Memory Copy Engine (ICCD 2005 and IEEE TC)
04/19/23 CS258 S99 20
Stack Optimizations (Instruction Count)
• Separate Data & Control Paths– TCP data-path focused
– Reduce # of conditionals
– NIC assist logic (L3/L4 stateless logic)
• Basic Memory Optimizations– Cache-line aware data structures
– SW Prefetches
• Optimized Computation– Standard compiler capability
3X reduction in Instructions per
Packet
04/19/23 CS258 S99 21
ChipsetChipset
Me
mo
ryM
em
ory
Me
mo
ryM
em
ory
CPUCPU
Me
mo
ryM
em
ory
Me
mo
ryM
em
ory
Me
mo
ryM
em
ory
Me
mo
ryM
em
ory
Mem
ory
Mem
ory
CPUCPU
Network Stack Affinity
Dedicated for network I/OIntel calls it Onloading
I/O
In
terf
ace
I/O
In
terf
ace
CPU
Cor
eC
ore
Cor
eC
ore
Cor
eC
ore
Cor
eC
ore
Cor
eC
ore
Cor
eC
ore
Cor
eC
ore
Cor
eC
ore
…
Assigns network I/O workloads to designated devices Separates network I/O from application work
Reduces scheduling overheads More efficient cache utilization Increases pipeline efficiency
04/19/23 CS258 S99 22
Direct Cache Access (DCA)
Normal DMA Writes
CPUCPU
Cache
MemoryMemory
NICNIC
Memory Controller
Memory Controller
Step 1DMA Write
Step 2Snoop Invalidate
Step 3Memory Write
Step 4CPU Read
Direct Cache Access
CPUCPU
Cache
MemoryMemory
NICNIC
Memory Controller
Memory Controller
Step 1DMA Write
Step 2Cache Update
Step 3CPU Read
Eliminate 3 to 25 memory accesses by placing packet data directly into cache
04/19/23 CS258 S99 23
Lightweight Threading
Single Hardware Context
Execution pipelineExecution pipelineS/W controlled thread 1
S/W controlled thread 2
Memory informing event (e.g. cache miss)
Continue computing in single pipeline in shadow of cache miss
Single Core Pipeline
Thread ManagerThread Manager
Builds on helper threads; reduces CPU stall
04/19/23 CS258 S99 24
Potential Efficiencies (10X)
On CPU, multi-gigabit, line speed On CPU, multi-gigabit, line speed network I/O is possiblenetwork I/O is possible
On CPU, multi-gigabit, line speed On CPU, multi-gigabit, line speed network I/O is possiblenetwork I/O is possible
Benefits of Affinity Benefits of Architectural TechnquesGreg Regnier, et al., “TCP Onloading for DataCenter Servers,” IEEE Computer, vol 37, Nov 2004
04/19/23 CS258 S99 25
I/O Acceleration – Problem MagnitudeI/O Acceleration – Problem Magnitude
I/O Processing Overheads
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
45,000
50,000
TCP Orig TCP Opt iSCSI SSL XML
Protocol Processing
Cycle
s (
Ban
ias 1
.7 G
Hz)
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
Data
Rate
(G
bp
s)
Cycles
Data Rate
Memory Copies &Memory Copies &Effects of StreamingEffects of Streaming
CRCsCRCs Crypto Crypto Parsing,Parsing,Tree ConstructionTree Construction
Storage over IP
Storage over IP
Networking
Networking
Security
Security
Services
Services
I/O Processing Rates are significantly limited by CPU in the face of Data I/O Processing Rates are significantly limited by CPU in the face of Data Movement and Transformation OperationsMovement and Transformation Operations