Software-based Failure Detection in Programmable Network ...
Transcript of Software-based Failure Detection in Programmable Network ...
![Page 1: Software-based Failure Detection in Programmable Network ...](https://reader030.fdocuments.in/reader030/viewer/2022021212/620658edd1033b42bb6aaae1/html5/thumbnails/1.jpg)
Software-based Failure Detection in Programmable
Network Interfaces
Yizheng Zhou, Vijay Lakamraju, Israel Koren, C.M. Krishna
Architecture and Real-Time Systems (ARTS) LabDepartment of Electrical & Computer Engineering
University of Massachusetts at AmherstARTS
![Page 2: Software-based Failure Detection in Programmable Network ...](https://reader030.fdocuments.in/reader030/viewer/2022021212/620658edd1033b42bb6aaae1/html5/thumbnails/2.jpg)
2 of 12
IntroductionComplex network interfaces
• Typical Ethernet controller: 10 thousand gates• IXP1200: 5 million gatesTransient faults: a major reliability concern
• Neutrons from cosmic rays• Alpha particles from packaging materialSoftware-based fault tolerance approaches
• Pros: Less expensive than• Custom hardware• Massive hardware redundancy
• Cons: Overhead• Performance degradation• Increased code size
![Page 3: Software-based Failure Detection in Programmable Network ...](https://reader030.fdocuments.in/reader030/viewer/2022021212/620658edd1033b42bb6aaae1/html5/thumbnails/3.jpg)
3 of 12
Software-Based Failure Detection
Network interface failures• Hardware failures• Software failures
• The instruction and data of the Network Control Program (NCP) in the local memory.
Requirements for failure detection of network interfaces• Limited performance impact
• Performance is critical for high-speed network interface• Good failure coverage
![Page 4: Software-based Failure Detection in Programmable Network ...](https://reader030.fdocuments.in/reader030/viewer/2022021212/620658edd1033b42bb6aaae1/html5/thumbnails/4.jpg)
4 of 12
Myrinet: An Example High-speed Network Interface
A cost-effective local area network technology
High bandwidth: ~2Gb/sLow latency: ~6.5μs
Components in an example Myrinet LAN:
![Page 5: Software-based Failure Detection in Programmable Network ...](https://reader030.fdocuments.in/reader030/viewer/2022021212/620658edd1033b42bb6aaae1/html5/thumbnails/5.jpg)
5 of 12
Simplified Block Diagram of The Myrinet Network Interface
Instruction-interpreting RISC processorDMA interfaceLink interfaceFast local memory (SRAM)
![Page 6: Software-based Failure Detection in Programmable Network ...](https://reader030.fdocuments.in/reader030/viewer/2022021212/620658edd1033b42bb6aaae1/html5/thumbnails/6.jpg)
6 of 12
Network Interface FailuresTransient faults in the form of random bit flips in the network interfaceFailures observed:
Unusually long latencyDMA failures
Corrupted messagesSend/Receive failures
Corrupted control informationNetwork interface hangs
(a) (b)
![Page 7: Software-based Failure Detection in Programmable Network ...](https://reader030.fdocuments.in/reader030/viewer/2022021212/620658edd1033b42bb6aaae1/html5/thumbnails/7.jpg)
7 of 12
Failure Detection Strategy Interface hangs
Software watchdog timerOther failures
A useful observation: applications generally use only a small portion of the NCP
Directed Delivery: used for tightly-coupled systems, allows direct remote memory accessNormal Delivery: used for general systems, allows reliable ordered message delivery Datagram Delivery: delivery is not guaranteed
Adaptive Concurrent Self-Testing (ACST)Test only part of the NCPAvoids testing & signaling benign faultsCan detect hardware & software failures
![Page 8: Software-based Failure Detection in Programmable Network ...](https://reader030.fdocuments.in/reader030/viewer/2022021212/620658edd1033b42bb6aaae1/html5/thumbnails/8.jpg)
8 of 12
Logical modules
Identify the “active” partsLogical module: The collection of all basic blocks that might participate in providing a service
To test a logical module: Trigger several requests/events to direct the control flow to go through all its basic blocks
![Page 9: Software-based Failure Detection in Programmable Network ...](https://reader030.fdocuments.in/reader030/viewer/2022021212/620658edd1033b42bb6aaae1/html5/thumbnails/9.jpg)
9 of 12
Experimental Results: Failure Coverage
Exhaustive fault injection into a single routine: send_chunkExhaustive fault injection into special registersRandom fault injection into the entire code segment
93.9%95.6%Entire code segment
32.3%99.2%Registers
60.3%99.3%Routine: send_chunk
No impactCoverage
![Page 10: Software-based Failure Detection in Programmable Network ...](https://reader030.fdocuments.in/reader030/viewer/2022021212/620658edd1033b42bb6aaae1/html5/thumbnails/10.jpg)
10 of 12
Performance Impact
The original Myrinet software: GMThe modified Failure Detection GM: FDGMThe MCP-level self-testing interval is set to 5 seconds
(a) (b)
![Page 11: Software-based Failure Detection in Programmable Network ...](https://reader030.fdocuments.in/reader030/viewer/2022021212/620658edd1033b42bb6aaae1/html5/thumbnails/11.jpg)
11 of 12
Performance Impact For Different Self-Testing Intervals
Message length is 2KBFor the half-second interval
bandwidth is reduced by 3.4%latency is increased by 1.6%
(a) (b)
![Page 12: Software-based Failure Detection in Programmable Network ...](https://reader030.fdocuments.in/reader030/viewer/2022021212/620658edd1033b42bb6aaae1/html5/thumbnails/12.jpg)
12 of 12
Conclusion
The proposed ACST tests only active logical modulesFailure coverage: over 95% No appreciable performance degradationTransparent to applicationsThe basic idea is generic – applicable to other fast network interfaces