La apuesta de Telefónica por la cloud privada
-
Upload
librecon -
Category
Technology
-
view
98 -
download
0
Transcript of La apuesta de Telefónica por la cloud privada
Introduction
• Showcase deployment of Red Hat OpenStack within the Digital Service Node (dSN) infrastructure in Telefónica I+D
• History & Motivations
• How we did it
• Problems we found
What is OpenStack?• OSS for creating private and public clouds
• Manages large pools of resources
• Compute, Storage, Networking
• User-friendly Web interface, REST APIs, CLI
• Cloud services: LBaaS, DNSaaS, FWaaS, DBaaS (Trove), BigData-aaS (Sahara), etc.
What is OpenStack?
• Reuses lots of OSS projects:
• Linux, KVM, Open vSwitch, MySQL, MongoDB, Apache, memcache, Python, etc.
• Builds missing pieces as OSS:
• Keystone, Glance, Cinder, Ceilometer, Nova, Neutron, etc.
Releases• Releases every 6 months
• Development release (unreleased):
• Kilo
• Currently supported releases:
• Juno (stable, security fixes, 2014.2), Icehouse (security fixes only, 2014.1)
Who makes OpenStack?
• Built by a thriving community of developers
• In collaboration with users
• Lots of contributors, like:
• AT&T, Canonical, Cisco, Dell, Ericsson, HP, IBM, Intel, RackSpace, Red Hat, NEC, NetApp, Novell, VMware, Yahoo!, etc.
Who uses OpenStack?• World-wide:
• Cisco ,CERN, PayPal, Wells Fargo, Wikimedia Foundation, SWITCH, Canonical, IBM, Intel, Mirantis, RackSpace, HP, Red Hat, SUSE, etc.
• Spain:
• Telefónica I+D, BTACTIC SCCL, Spanish National Research Council, BBVA
• Source: http://www.openstack.org/user-stories/
Digital Service Node (dSN)
• Digital services from Telefónica deployed here
• Main datacenter in Madrid
• Most digital services from Telefónica have been migrated to the dSN during 2013
• New digital services developed within Telefónica I+D are deployed in the dSN
Why OpenStack• IaaS platform on top of existing infrastructure
• Previous bets on other technologies failed:
• Joyent, Cloud Stack, vCloud Director, Tashi, …
• Aligned with Telefónica’s Technological Plan:
• Open & Open Source
• Strong industry support
Why OpenStack• Well suited for DevOps:
• Automated, repeatable deployments, CI & CD
• OpenStack API allows developers, testers, integrators, deployment engineers, etc. to deploy software in a consistent manner
• Development, integration, pre-production and production
Why Red Hat OpenStack• Red Hat is a trusted partner and provider
• Red Hat Enterprise Linux is the reference OS within Telefónica’s Technological Plan for production services
• Strong commercial and technical support from Red Hat
• Helps when meeting our SLAs
Why Red Hat OpenStack• Maturity of OpenStack was in question:
• Telefónica trusted in Red Hat’s support, technical know-how and expertise in Linux, OpenStack and OSS
• Red Hat Professional Services were key to deploying OpenStack within Telefónica’s dSN:
• Strong engineering and technical background
• Key contributor to OpenStack
• Back porting of fixes from newer releases
Initial requirements• To have 2 separate OpenStack clusters in dSN:
• Pre-production: integration tests, load-tests, before digital services hit production
• Production: digital services in production
• Based on Red Hat OpenStack 4.0 (Havana)
• Meets reference architecture defined by Telefónica’s Cloud division
History• PoC in June 2012: OpenStack vs. CloudStack
• First serious deployment in May 2013 using RHOS 3.0 (Grizzly)
• First production deployment in December 2013 using RHOS 3.0 (Grizzly)
• Second production deployment in February 2014 using RHOS 4.0 (Havana)
How we did it
• A team of several engineers from Telefónica I+D
• Plus professional services from Red Hat
• Analysed the Telefónica I+D requirements
• Created a deployment plan
• And executed it
Problems we faced
• Architecture
• Manual deployment
• Migration from Quantum to Neutron
• Manual workarounds and patches
dSN peculiarities• Multiple external networks (Internet, Management and
Internodos):
• Each external network requires a dedicated Neutron L3 agent
• Keystone integration with dSN OpenLDAP
• Several layers of firewalls
• Hardware load-balancers
• NetApp storage (exposes Cinder API and NFS)
Architecture
• Adapted and verified by Red Hat Professional Services to meet dSN peculiarities
• Reference HA architecture has changed since first deployment :
• Active-Active components not in PaceMaker
• Neutron is a SPoF and a bottleneck
High Availability
• Infrastructure is fully redundant:
• Firewalls, load-balancers, routers, etc.
• OpenStack API working in Active-Active mode
• MySQL, MongoDB, QPID, Neutron, Ceilometer, Heat deployed in Active-Passive mode and managed by PaceMaker
PaceMaker
• Initial HA deployment was suboptimal:
• Only Active-Passive services in PaceMaker
• Complex set of dependencies and constraints
• Non-critical services could prevent critical ones from failing over automatically (e.g. MongoDB)
PaceMaker’s failcount• We didn’t know about it:
• Multiple resource start up failures end up disabling it (failcount <- INFINITY)
• Learned about it the hard way:# pcs resource failcount show mongoDB Failcounts for mongoDB 10.26.238.227: INFINITY # crm_failcount -‐r mongoDB -‐G scope=status name=fail-‐count-‐mongoDB value=INFINITY # crm_failcount -‐r mongoDB -‐U 10.26.238.227 -‐v 0
Deployment• Automated installation via Satellite & Foreman:
• Unattended network-based installs (PXE, Kickstart)
• Integrates ISC DHCP, TFTP, Cobbler, Puppet, Pulp, Candelpin, etc.
• HA reference architecture not supported
• Custom Puppet classes developed specifically for installing OpenStack in Telefónica I+D
Compute nodes
• Completely automated
• Custom Puppet for deploying compute nodes:
• Satellite installs base OS over the network
• Foreman finishes configuration with Puppet
Controller nodes
• Mostly manual process
• Complex set up:
• SSL/TLS
• Active-Passive services using PaceMaker
• Active-Active services using HW LBs
Controller nodes• Manual configuration:
• Configure DNS
• Generate SSL/TLS certificates
• Configure QPID SSL certificate store
• Configure OpenLDAP schema for Keystone
• Configure NFS
• Configure VLANs
• Configure FWs and LBs
• Satellite PXE-installs base OS
• PaceMaker configuration
• Keystone configuration
• Create basic tenants and roles
• Glance, Neutron, Nova, Cinder, Ceilometer, Heat, Horizon
Quantum to Neutron
• Quantum (Grizzly) to Neutron (Havana)
• Neutron configured in Active-Passive
• Multiple L3 agents
• Increased instability and complexity
Neutron HA• Failovers disrupt connectivity:
• Neutron nodes have different MAC addresses
• Unicast vs. Multicast (not used)
• Have to update ARP tables in hosts and switches
• Network namespaces not properly cleaned up
Kernel oopsen
• Kernel constantly logging oops messages due to Ethernet HW checksum problems
• Floods our logging facility
• Presumed incompatibility between kernel and our Cisco NIC drivers
• Solution consists of disabling HW CKSUM
Nov 2 03:06:41 esjc-ostt-cc05l kernel: qg-8b0190e0-cc: hw csum failure.Nov 2 03:06:41 esjc-ostt-cc05l kernel: Pid: 0, comm: swapper Not tainted 2.6.32-504.el6.x86_64 #1Nov 2 03:06:41 esjc-ostt-cc05l kernel: Call Trace:Nov 2 03:06:41 esjc-ostt-cc05l kernel: <IRQ> [<ffffffff8145cd32>] ? netdev_rx_csum_fault+0x42/0x50Nov 2 03:06:41 esjc-ostt-cc05l kernel: [<ffffffff81454660>] ? __skb_checksum_complete_head+0x60/0x70Nov 2 03:06:41 esjc-ostt-cc05l kernel: [<ffffffff81454681>] ? __skb_checksum_complete+0x11/0x20Nov 2 03:06:41 esjc-ostt-cc05l kernel: [<ffffffff814df23d>] ? nf_ip_checksum+0x5d/0x130Nov 2 03:06:41 esjc-ostt-cc05l kernel: [<ffffffffa0365f91>] ? udp_error+0xb1/0x1e0 [nf_conntrack]Nov 2 03:06:41 esjc-ostt-cc05l kernel: [<ffffffffa027b652>] ? ovs_vport_send+0x22/0x90 [openvswitch]Nov 2 03:06:41 esjc-ostt-cc05l kernel: [<ffffffffa0360028>] ? nf_conntrack_in+0x138/0xa00 [nf_conntrack]Nov 2 03:06:41 esjc-ostt-cc05l kernel: [<ffffffffa027b6ee>] ? ovs_vport_receive+0x2e/0x30 [openvswitch]Nov 2 03:06:41 esjc-ostt-cc05l kernel: [<ffffffffa01ce721>] ? ipv4_conntrack_in+0x21/0x30 [nf_conntrack_ipv4]Nov 2 03:06:41 esjc-ostt-cc05l kernel: [<ffffffff8148bdc9>] ? nf_iterate+0x69/0xb0Nov 2 03:06:41 esjc-ostt-cc05l kernel: [<ffffffff81496410>] ? ip_rcv_finish+0x0/0x440Nov 2 03:06:41 esjc-ostt-cc05l kernel: [<ffffffff8148bf86>] ? nf_hook_slow+0x76/0x120Nov 2 03:06:41 esjc-ostt-cc05l kernel: [<ffffffff81496410>] ? ip_rcv_finish+0x0/0x440Nov 2 03:06:41 esjc-ostt-cc05l kernel: [<ffffffff81496ab4>] ? ip_rcv+0x264/0x350Nov 2 03:06:41 esjc-ostt-cc05l kernel: [<ffffffffa027cb83>] ? ovs_netdev_frame_hook+0xb3/0x110 [openvswitch]Nov 2 03:06:41 esjc-ostt-cc05l kernel: [<ffffffff8145c88b>] ? __netif_receive_skb+0x4ab/0x750Nov 2 03:06:41 esjc-ostt-cc05l kernel: [<ffffffff8145cbca>] ? process_backlog+0x9a/0x100Nov 2 03:06:41 esjc-ostt-cc05l kernel: [<ffffffff81462083>] ? net_rx_action+0x103/0x2f0Nov 2 03:06:41 esjc-ostt-cc05l kernel: [<ffffffff8107d8b1>] ? __do_softirq+0xc1/0x1e0Nov 2 03:06:41 esjc-ostt-cc05l kernel: [<ffffffff810b034a>] ? tick_program_event+0x2a/0x30Nov 2 03:06:41 esjc-ostt-cc05l kernel: [<ffffffff8100c30c>] ? call_softirq+0x1c/0x30Nov 2 03:06:41 esjc-ostt-cc05l kernel: [<ffffffff8100fc15>] ? do_softirq+0x65/0xa0Nov 2 03:06:41 esjc-ostt-cc05l kernel: [<ffffffff8107d765>] ? irq_exit+0x85/0x90Nov 2 03:06:41 esjc-ostt-cc05l kernel: [<ffffffff81533c0a>] ? smp_apic_timer_interrupt+0x4a/0x60Nov 2 03:06:41 esjc-ostt-cc05l kernel: [<ffffffff8100bb93>] ? apic_timer_interrupt+0x13/0x20Nov 2 03:06:41 esjc-ostt-cc05l kernel: <EOI> [<ffffffff812ea5ee>] ? intel_idle+0xde/0x170Nov 2 03:06:41 esjc-ostt-cc05l kernel: [<ffffffff812ea5d1>] ? intel_idle+0xc1/0x170Nov 2 03:06:41 esjc-ostt-cc05l kernel: [<ffffffff81425b97>] ? cpuidle_idle_call+0xa7/0x140Nov 2 03:06:41 esjc-ostt-cc05l kernel: [<ffffffff81009fc6>] ? cpu_idle+0xb6/0x110Nov 2 03:06:41 esjc-ostt-cc05l kernel: [<ffffffff8151061a>] ? rest_init+0x7a/0x80Nov 2 03:06:41 esjc-ostt-cc05l kernel: [<ffffffff81c2af8f>] ? start_kernel+0x424/0x430Nov 2 03:06:41 esjc-ostt-cc05l kernel: [<ffffffff81c2a33a>] ? x86_64_start_reservations+0x125/0x129Nov 2 03:06:41 esjc-ostt-cc05l kernel: [<ffffffff81c2a453>] ? x86_64_start_kernel+0x115/0x124
Keystone vs. UTF-8
• Doesn’t cope well with UTF-8 out of the box
• Had to manually patch source code:
• Not yet integrated upstream
• Patch has to be reapplied after upgrades
/usr/lib/python2.6/site-‐packages/keystone/__init__.py:
import sys +import default_encoding_utf8 import pkg_resources
# If there is a conflicting non egg module, # i.e. an older standard system module installed, # then replace it with this requirement def replace_dist(requirement): …
MongoDB rc.d• Script starts MongoDB
• Checks that it accepts queries by means of “mongostat”:
• But “mongostat” currently doesn’t work when authentication is mandatory
• Had to patch the rc.d script to comment out the invocation of “mongostat”
MongoDB rc.d• Init script returns OK even if service is not fully
operational:
• https://bugzilla.redhat.com/show_bug.cgi?id=1066408
• Latest errata from Red Hat introduces a syntax error: [ ] for comparison instead of [[ ]]:
• https://bugzilla.redhat.com/show_bug.cgi?id=1158076
Poor SSL/TLS support• At the time of our first deployments, SSL/TLS support was poor
or missing
• Manual patches and workarounds deployed:
• /usr/lib/python2.6/site-packages/ceilometer/alarm/service.py
• /usr/lib/python2.6/site-packages/ceilometer/service.py
• /usr/lib/python2.6/site-packages/ceilometer/image/glance.py
• /usr/lib/python2.6/site-packages/neutron/agent/metadata/agent.py
Poor SSL/TLS support• Update openstack-keystone as it fails to start if
SSL is configured
• Updated package python-django-openstack-auth to allow user authentication when keystone is using SSL
• Updated package python-eventlet to allow access to Glance via HTTPS for several services
Some numbers
• 5 OpenStack private clouds:
• 2 production private clouds, interconnected over a backbone network plus 1 testbed environment
• 1 development private cloud plus 1 testbed environment
Some numbers• First support case was opened on 2013/02/05
• Since then, we have opened 149 support cases:
• 30 support cases in 2013
• 119 support cases in 2014
• Support case does not necessarily mean “bug”:
• Used for RFE, consulting, advice, etc.
Improvements• Converge with Red Hat HA reference
architecture
• More coverage for our Puppet classes
• Less overlap between Satellite and Puppet
• Missing automated integration test suite:
• Rally, Tempest
Better Monitoring
• More visibility into network traffic (Neutron):
• Neutron API, Open vSwitch, netns, L3 agents, DHCP agents, metadata agent, etc.
• Better tools for tracing network packets
• Better instrumentation (latency, throughput)
How to test-drive OpenStack?
• Run it on your laptop or PC:
• DevStack, PackStack
• Run it on cluster of machines:
• CentOS / Fedora, deploy with Foreman
• Ubuntu, deploy with MaaS, Juju