One Network Engineer - ausnog.net · 2015. 8. 25. · 844 million mobile daily active users on...
Transcript of One Network Engineer - ausnog.net · 2015. 8. 25. · 844 million mobile daily active users on...
ONE
James Paussanetwork infrastructure engineer
One Network Engineer
What Do We Do?
• A single engineer runs the network at a time
• Network monitoring and alarming
• Responsible for the entire production network
Facebook Scale
968 million daily active users on average
844 million mobile daily active users on average
1.31 billion mobile monthly active users
1.49 billion monthly active users
Machine to machine
Machine to user
Facebook Scale
Automation
What We Don’t Do
• On site work (cleaning fibres, LCs, etc)
• Working with remote hands
• Device deployment
Myths
• Automation fixes everything
• We’ve fixed everything
• Doesn’t apply
Link Imbalance
Link Imbalance
Link Imbalance100
50
0
Interface Utilization
Link Imbalance
Link Imbalance100
50
0
Interface Utilization
Link Imbalance100
50
0
Interface Utilization
Link Imbalance
srsly wtf.
Link Imbalance
• Intercluster links only
• >80%+ Util
• Migratory
What Now?
• Detection
• Mitigation
Imbalance Detection
Aggregated Interfaces
Member Statistics
Compare Utilisation
Check Fails Notify Oncall
Imbalance Mitigation
SIP DIP SPORT DPORT Hash Key
Imbalance Mitigation
Roll Hash
Imbalance Mitigation
Link Imbalance
Roll HashDetect ResolvedFBAR
Link Imbalance
ZOMG!! WTF!! NO, NO, NO! &(#$&*(
Link Imbalance
RX Window
During
Link Imbalance
Open TCP connections
During After
Cache DB
It Isn’t (always) The Network
Cache DB
Cache DB
Cache DB
Link Imbalance - Lessons Learned
• Resolved issues
• Root Cause
• Software helps
• Service owner identified
• Resolution time
• Small loss, significant impact
3 Months In A Leaky Boat
!Capacity
Health
Errors
Memory Issues
DIP Loss Latency
1.1.1.1 0.1 10
2.2.2.2 0 10
DIP Loss Latency
1.1.1.1 0.1 10
Servers ServersServers
Servers ServersServers
Detection
Loss Effects on Throughput1000
400
00.00000% 0.00001% 0.00010% 0.00100% 0.01000% 0.10000% 1.00000%
Packet Loss %
Thro
ugh
pu
t (m
bp
s) 800
600
200
RTT0
X20025
XXXX
XX
XX
XX X X0
0.00000% 0.00001% 0.00010% 0.00100% 0.01000% 0.10000% 1.00000%
Packet Loss %
Thro
ugh
pu
t (m
bp
s)
350
50
100
150
250
300
200
Different algos?
X
RenoCubicVegasIllinois
Recovery time
0
Thro
ugh
pu
t (m
bp
s)
350
50
100
150
250
300
200
Time (sec)0 120110100908070605040302010
1% P
acke
t Los
s RenoCubicVegasIllinois
So, wait, how does this apply to me?
Alarms
0
70k
70k!
!Capacity
Health
Errors
Interface Issues
!!!!!!!!!!!…………………………………………………..
Alarms Now
0
70
<100 -99.99%
Automation - One Month
3.37b0.99%
750k99.6%
Why?
Sleep
Why?
750,000 * 2 = 1,500,0001,500,000 / 60 = 25,000
25,000 / 160150+
Lessons Learned & Take AwaysUse what you’ve got
Prototype early, iterate often
Duct tape keeps things running
Spend the time to root cause issues
The sooner the robots take over the better
Why?
?