How to Increase Availability Using ExaBGP - SwiNOG€¦ · About VSHN AG Owner-operated Swiss...
Transcript of How to Increase Availability Using ExaBGP - SwiNOG€¦ · About VSHN AG Owner-operated Swiss...
How to Increase Availability Using ExaBGP
SwiNOG #30 – November 4th, 2016Gurtenpark, Berne
Your SpeakersAndré Keller● System Engineer at VSHN AG● AtrilA GmbH (2010 - 2014)● Network Design GmbH (2005 - 2012)
@andrekeller_ch | [email protected]
https://github.com/andrekeller
Manuel Schweizer● Network Engineer at cloudscale.ch AG● Board Member at SwissIX Internet Exchange
@geitguet | [email protected]
About VSHN AG● Owner-operated Swiss company since 2014
● Seventeen employees at the head office near Zurich main station
● Service provider for DevOps, software delivery automation and configuration management
● Partner for web operations/hosting of web applications
● Further specialty fields: Consulting, System Engineering, Continuous Delivery, Monitoring, Backup, 24/7 Support
About cloudscale.ch AG● Swiss cloud service provider (IaaS, Linux)
● Infrastructure operated exclusively in Swiss data centers
● Based on OpenStack & Ceph
● Five employees in Zurich-Oerlikon
● Focus on self-service and simplicity
Agenda
1. Availability2. Service IPs
3. ExaBGP Examples
4. Pitfalls
5. Outlook
Agenda
1. Availability2. Service IPs
3. ExaBGP Examples
4. Pitfalls
5. Outlook
Availability
aka: “How many 9's?”
Availability Downtime peryear
Downtime per month
Downtime per day
One nine (90%) 36.5 days 72 hours 2.4 hours
Two nines (99%) 3.65 days 7.2 hours 14.4 minutes
Three nines (99.9%) 8.76 hours 43.8 minutes 1.44 minutes
Four nines (99.99%) 52.56 minutes 4.38 minutes 8.66 seconds
Five nines (99.999%) 5.26 minutes 25.9 seconds 864.3 ms
Availability
aka: “How many 9's?”
● What is your SLA?● How much revenue do you lose per …?
Availability Downtime peryear
Downtime per month
Downtime per day
One nine (90%) 36.5 days 72 hours 2.4 hours
Two nines (99%) 3.65 days 7.2 hours 14.4 minutes
Three nines (99.9%) 8.76 hours 43.8 minutes 1.44 minutes
Four nines (99.99%) 52.56 minutes 4.38 minutes 8.66 seconds
Five nines (99.999%) 5.26 minutes 25.9 seconds 864.3 ms
Definition of Availability
Provider View● Scheduled Maintenance● Emergency Maintenance● Force Majeure (DDoS?)● Availability Zones● “We are not to blame”
Definition of Availability
Provider View● Scheduled Maintenance● Emergency Maintenance● Force Majeure (DDoS?)● Availability Zones● “We are not to blame”
Customer View● “Can I reach the service?”
Today's Goal
Provide a service that is available (almost) 100% of the time.
Today's Goal
Provide a service that is available (almost) 100% of the time.
Not:
100% uptime of your server, VM, application etc.
Agenda
1. Availability
2. Service IPs
3. ExaBGP Examples
4. Pitfalls
5. Outlook
Service IPs Explained
Service IPs Explained
Service IPA.B.C.D/32
Completely different subnet!
Suggestion:“Anycast” subnet with /32 only
Service IPs Explained
Advantages
Advantages
● High Availability
Advantages
● High Availability● IP Mobility
Before After
Advantages
● High Availability● IP Mobility● Maintenance
Advantages
● High Availability● IP Mobility● Maintenance● Anycast / Load-Balance
● Local-pref● ECMP
Agenda
1. Availability
2. Service IPs
3. ExaBGP Examples
4. Pitfalls
5. Outlook
IP Mobility
IP Mobility - Network Config/etc/network/interfaces.d/eth0.cfg
auto eth0iface eth0 inet static address 192.0.2.100 netmask 24 gateway 192.0.2.1
/etc/network/interfaces.d/services.cfg
auto br0iface br0 inet static address 203.0.113.1 netmask 32 bridge_ports none
IP Mobility - Unbound Config/etc/unbound/unbound.conf.d/resolver.conf
server: interface: 127.0.0.1 interface: ::1 interface: 192.0.2.100 interface: 203.0.113.1 outgoing-interface: 192.0.2.100
access-control: 127.0.0.0/8 allow_snoop access-control: ::1 allow_snoop
access-control: 192.0.2.0/24 allow
IP Mobility - ExaBGP Config/etc/exabgp/exabgp.conf
group default { router-id 192.0.2.100; local-as 65001; peer-as 65002; hold-time 30; group-updates yes;
neighbor 192.0.2.1 { local-address 192.0.2.100; family { ipv4 unicast }; static { route 203.0.113.1/32 next-hop self; } } neighbor 192.0.2.2 { … };}
Maintenance
Maintenance - Network Config/etc/network/interfaces.d/eth0.cfg
auto eth0iface eth0 inet static address 192.0.2.100 # .200 on dns2 netmask 24 gateway 192.0.2.1
/etc/network/interfaces.d/services.cfg
auto br0 br1iface br0 inet static address 203.0.113.1 netmask 32 bridge_ports none
iface br1 inet static address 203.0.113.2 netmask 32 bridge_ports none
Maintenance - Unbound Config/etc/unbound/unbound.conf.d/resolver.conf
server: interface: 127.0.0.1 interface: ::1 interface: 192.0.2.100 # .200 on dns2 interface: 203.0.113.1 interface: 203.0.113.2 outgoing-interface: 192.0.2.100
access-control: 127.0.0.0/8 allow_snoop access-control: ::1 allow_snoop
access-control: 192.0.2.0/24 allow
Maintenance - ExaBGP Config/etc/exabgp/exabgp.conf
group default { router-id 192.0.2.100; local-as 65001; peer-as 65002; hold-time 30; group-updates yes;
neighbor 192.0.2.1 { local-address 192.0.2.100; family { ipv4 unicast }; static { route 203.0.113.1/32 next-hop self; route 203.0.113.2/32 next-hop self; } } neighbor 192.0.2.2 { … };}
Anycast
Agenda
1. Availability
2. Service IPs
3. ExaBGP Examples
4. Pitfalls
5. Outlook
Pitfalls (BGP)
● Next Hop Reachability
Next Hop?
Pitfalls (BGP)
● Next Hop Reachability● eBGP Multihop
Peering With Loopback?
Pitfalls (BGP)
● Next Hop Reachability● eBGP Multihop● Redistribution
Be careful when redistributing into an IGP!
Pitfalls (BGP)
● Next Hop Reachability● eBGP Multihop● Redistribution● Timer Tuning
Pitfalls (Blackholing)
Ensure that you do not announce Service IPs when service is not running:
– Monitoring
– Conditional announce of routes in ExaBGP
Pitfalls (Blackholing)/usr/local/bin/tcp-healthcheck.py
#!/usr/bin/env pythonimport socketfrom sys import stdoutfrom time import sleep
def is_alive: try: s = socket.socket() s.connect((‘127.0.0.1’, 53)) return True except socket.error: return False finally: s.close()
while True: if is_alive(): stdout.write(“announce route 203.0.113.1/32 nexthop-self\n”) else: stdout.write(“withdraw route 203.0.113.1/32 nexthop-self\n”) stdout.flush() time.sleep(5)
Pitfalls (Blackholing)/etc/exabgp/exabgp.conf
group default { router-id 192.0.2.100; local-as 65001; peer-as 65002; hold-time 30; group-updates yes;
neighbor 192.0.2.1 { local-address 192.0.2.100; family { ipv4 unicast }; } neighbor 192.0.2.2 { … };
process add-routes { run /usr/bin/python /usr/local/bin/tcp-healtcheck.py }}
Pitfalls (Stateful Service)
● Ensure that under normal condition traffic for a service IP is always routed to the same node
● Backends (database / session store) need to be reachable by all nodes
Agenda
1. Availability
2. Service IPs
3. ExaBGP Examples
4. Pitfalls
5. Outlook
Erco - ExaBGP Route Controller
● Erco: https://erco.xyz/
Questions
Appendix: Install ExaBGP
● Multiple possibilities:– Use distribution packages:
● apt-get / yum install exabgp– Outdated version– Not prepared to run multiple instances
● For Ubuntu we provide a PPA with latest version:– https://launchpad.net/~vshn/+archive/ubuntu/exabgp
● Latest version● Installs multi-instance service (xenial-only) and a
dedicated service user
– Use PIP:● pip install exabgp
– Does not come with a start script– Separate update channel
Appendix: Install ExaBGP● Multiple instance systemd service when installing
via pip:[Unit]
Description=exabgp Service, %i
[Service]
ExecStart=/usr/local/bin/exabgp /etc/exabgp/%i.conf
Restart=always
StandardOutput=syslog
StandardError=syslog
SyslogIdentifier=exabgp-%i
Type=simple
[Install]
WantedBy=multi-user.target