TOTAL 23 SLIDES BELOW The network is Reliable An informal survey of real-world communications...

25

description

The network is Reliable An informal survey of real-world communications failures BY PETER BAILIS AND KYLE KINGSBURY

Transcript of TOTAL 23 SLIDES BELOW The network is Reliable An informal survey of real-world communications...

Page 1: TOTAL 23 SLIDES BELOW The network is Reliable An informal survey of real-world communications failures BY PETER BAILIS AND KYLE KINGSBURY.
Page 2: TOTAL 23 SLIDES BELOW The network is Reliable An informal survey of real-world communications failures BY PETER BAILIS AND KYLE KINGSBURY.

TOTAL 23 SLIDES BELOW

Page 3: TOTAL 23 SLIDES BELOW The network is Reliable An informal survey of real-world communications failures BY PETER BAILIS AND KYLE KINGSBURY.

The network is ReliableAn informal survey of real-world communications failures

BY PETER BAILIS AND KYLE KINGSBURY

Page 4: TOTAL 23 SLIDES BELOW The network is Reliable An informal survey of real-world communications failures BY PETER BAILIS AND KYLE KINGSBURY.

CONTENTS

• Abstract

• Various survey reports of network reliability under different circumstance

• Conclusion

Page 5: TOTAL 23 SLIDES BELOW The network is Reliable An informal survey of real-world communications failures BY PETER BAILIS AND KYLE KINGSBURY.

ABSTRACT• “The network is reliable.” is a fallacy of distributed

computing.

• The degree of network reliability is critical for systems to function robustly.

• It is hard to determine the degree of network reliability .

Page 6: TOTAL 23 SLIDES BELOW The network is Reliable An informal survey of real-world communications failures BY PETER BAILIS AND KYLE KINGSBURY.

VARIOUS SURVEY REPORTS OF

NETWORK RELIABILITY UNDER

DIFFERENT CIRCUMSTANCE

Page 7: TOTAL 23 SLIDES BELOW The network is Reliable An informal survey of real-world communications failures BY PETER BAILIS AND KYLE KINGSBURY.

LARGE DEPLOYMENTS & ISSUES

• What are large deployments?Large deployments mean a distributed network system that is run globally having distributed infrastructure with hundreds of thousands of servers.

• What is serious considered issue in large deployments?

Partitions : A network partition refers to the failure of a network device that causes a network to be split

Page 8: TOTAL 23 SLIDES BELOW The network is Reliable An informal survey of real-world communications failures BY PETER BAILIS AND KYLE KINGSBURY.

LARGE DEPLOYMENTS & ISSUES(CONTD.)EXAMPLES

BEHAVIOR OF NETWORK FAILURE IN MICROSOFT DATACENTERS

Average failure rate• 5.2 devices/day • 40.8 links/day.• which causes Avg loss of 59000 packets

per failure.• Avg time to repair is of approximately five

minutes• Redundancy improves Avg traffic by 43%.

Devices Links0

20

40

Per Day Failures

Page 9: TOTAL 23 SLIDES BELOW The network is Reliable An informal survey of real-world communications failures BY PETER BAILIS AND KYLE KINGSBURY.

LARGE DEPLOYMENTS & ISSUES(CONTD.)EXAMPLES

NETWORK FAILURES IN HP’S MANAGED NETWORKS

Analysis of Support ticket data• Connectivity-related tickets

accounted for 11.4%• 14% of which were of the highest

priority level• 2 hours and 45 minutes for the

highest priority tickets and a median duration of 4 hours 18 minutes for all tickets

Conectivity Related

High Priority048

12Trouble Tickets

Page 10: TOTAL 23 SLIDES BELOW The network is Reliable An informal survey of real-world communications failures BY PETER BAILIS AND KYLE KINGSBURY.

LARGE DEPLOYMENTS & ISSUES(CONTD.)EXAMPLES

FIRST YEAR FOR NEW GOOGLE CLUSTER INVOLVES

Five racks were faulty

(40–80 machines

seeing 50% packet loss)

Eight network maintenances (four might

cause 30-minute random

connectivity losses)

Three router failures (have

to immediately pull traffic for

an hour)

Page 11: TOTAL 23 SLIDES BELOW The network is Reliable An informal survey of real-world communications failures BY PETER BAILIS AND KYLE KINGSBURY.

LARGE DEPLOYMENTS & ISSUES(CONTD.)

How these companies try to repair network

partitions?

Google(by Dean): “easy-to use” abstractions

PNUTS: Weeker consistency alternatives

Page 12: TOTAL 23 SLIDES BELOW The network is Reliable An informal survey of real-world communications failures BY PETER BAILIS AND KYLE KINGSBURY.

DATACENTER NETWORK FAILURES

A Datacenter of Google

Main factors of Failures :

1)Power failure2)Misconfiguration3)Firmware bugs4)Topology changes5)Cable damage 6)Malicious traffic

Page 13: TOTAL 23 SLIDES BELOW The network is Reliable An informal survey of real-world communications failures BY PETER BAILIS AND KYLE KINGSBURY.

CLOUD NETWORKSWhat is Cloud Networks?

Key issues:• 1)Transient latency• 2)Dropped packets• 3)Full network partitions

Page 14: TOTAL 23 SLIDES BELOW The network is Reliable An informal survey of real-world communications failures BY PETER BAILIS AND KYLE KINGSBURY.

CLOUD NETWORKS(CONTD.)

When two nodes connected to the

internet but unable to see each other?

What experience can we learn from

this case?

Page 15: TOTAL 23 SLIDES BELOW The network is Reliable An informal survey of real-world communications failures BY PETER BAILIS AND KYLE KINGSBURY.

HOST PRVIDERSCould host providers offer reliable networks?

E.g. Freistil IT : a specific data center has50%-100%packet loss that leads

GlusterFS disturbuted file system to entire split-brain undetected

Why?

What is the main issue?

Page 16: TOTAL 23 SLIDES BELOW The network is Reliable An informal survey of real-world communications failures BY PETER BAILIS AND KYLE KINGSBURY.

WIDE AREA NETWORKS(WAN)

• Why WAN failures are particularly interesting?

• Example: CENIC: Average partition duration(5 years): SRF: 6 mins HRF:8.2 hours

Conclusion: Graceful degradationUnder partition or increased Latency is especially important for WAN.

Page 17: TOTAL 23 SLIDES BELOW The network is Reliable An informal survey of real-world communications failures BY PETER BAILIS AND KYLE KINGSBURY.

GLOBAL ROUTING FAILURES

•Can a high level redundancy internet system be safe?

1) Firewall configuration error: e.g CloudFlare

2)Firmware bug: e.g Juniper Networks

3) BGP misconfiguration: e.g Pakistan Telecom

Page 18: TOTAL 23 SLIDES BELOW The network is Reliable An informal survey of real-world communications failures BY PETER BAILIS AND KYLE KINGSBURY.

NICS AND DRIVERSFirmware bug: NICs problem

e.g. BCM5709 (chip model)

Misconfiguration : Drivers problem

e.g. bnx2

Page 19: TOTAL 23 SLIDES BELOW The network is Reliable An informal survey of real-world communications failures BY PETER BAILIS AND KYLE KINGSBURY.

APPLICATION-LEVEL FAILURES

What are the issues causing messages drop ping and delay?

1).Crashes

2). Program errors

3).Scheduler latency

4).Overloaded processes

Page 20: TOTAL 23 SLIDES BELOW The network is Reliable An informal survey of real-world communications failures BY PETER BAILIS AND KYLE KINGSBURY.

CONCLUSIONWhere are the communication failures occur?

• Processes• Servers• NICs, switches• local and wide area networks• Etc.

Page 21: TOTAL 23 SLIDES BELOW The network is Reliable An informal survey of real-world communications failures BY PETER BAILIS AND KYLE KINGSBURY.

CONCLUSION(CONTD.)• Whether there exist a reliable network?

• Depends on

1).Cautious engineering 2)Aggressive network advance 3).Lots of investments

Page 22: TOTAL 23 SLIDES BELOW The network is Reliable An informal survey of real-world communications failures BY PETER BAILIS AND KYLE KINGSBURY.

CONCLUSION(CONTD.)

•What can we do ? Consider the risk before a partition occurs.

Page 23: TOTAL 23 SLIDES BELOW The network is Reliable An informal survey of real-world communications failures BY PETER BAILIS AND KYLE KINGSBURY.

QUESTIONS TIME ! LOL!

Page 24: TOTAL 23 SLIDES BELOW The network is Reliable An informal survey of real-world communications failures BY PETER BAILIS AND KYLE KINGSBURY.

REFERENCES•  "Physical Network Interface". Microsoft. January 7, 2009.• Stonebraker, Michael (April 5, 2010). "Errors in Database

Systems, Eventual Consistency, and the CAP Theorem". Communications of the ACM

• CityCloud, 2011; https://www.citycloud.eu/cloudcomputing/

post-mortem/.• Davidson, S.B., Garcia-Molina, H. and Skeen, D. Consistency in a partitioned network: A survey. ACM Computing Surveys 17, 3 (1985), 341–370; http:// dl.acm.org/citation.cfm?id=5508.

Page 25: TOTAL 23 SLIDES BELOW The network is Reliable An informal survey of real-world communications failures BY PETER BAILIS AND KYLE KINGSBURY.

THANK YOU FOR YOUR PATIENCE