HA for OpenStack, from the control plane to...

HA for OpenStack, from the control plane to instancesTheory

Adam Spiers

Senior Software Engineer

aspiers@suse.com

Charles Wang

Software Engineer

cwang@suse.com

Agenda

● HA in a Typical OpenStack Cloud Today● When do we need HA for Compute Nodes?● Architectural Challenges● Solution in SUSE OpenStack Cloud

HA in OpenStack Today

Typical HA Control Plane● Automatic restart of controller services● Increases uptime of cloud

Services Cluster

Node 1 Node 2 Node 3

Orchestration

Keystone

GlanceNova

Dashboard

Cinder

Telemetry

Neutron

Pacemaker Cluster

Control Node 1 Node

DRBDPostgreSQL

RabbitMQ

KeystoneGlanceNova

DashboardCinder

Neutron

Database Cluster

Node 1 Node 2

DRBD or shared storage

Database

Message Queue

Under the Covers● Recommended by official HA guide

Services Cluster

Node 1 Node 2 Node 3

Corosync

Pacemaker

HAProxy

SOLVED

(mostly)

If Only the Control Plane is HA...

HA Cluster

Control node

Message queue

Database

Identity

Block storage

Networking

Dashboard

Compute

Compute node

libvirt

Compute node

nova-compute

libvirt

Compute node

nova-compute

libvirt

nova-compute

When is Compute HA important?

Addressing the White Elephant in the Room

Pets in thecloud?!

Pets vs Cattle● Pets are given names like mittens.mycompany.com

● Cattle are given names like vm0213.cloud.mycompany.com

● Each one is unique, lovingly hand-raised and cared for

● They are almost identical to other cattle

● When they get ill, you nurse them back to health

● When one gets ill, you get another one

What does that mean in practice?

● VM instances often stateful, with mission-critical data

● Stateless, or ephemeral (disposable) storage

● Needs automated recovery with data protection

● Already ideal for cloud … but automated recovery still needed!

● Service downtime when a pet dies

● Service resilient to instances dying

If compute node is hosting cattle …

… to handle failures at scale, we need to automatically restart VMs somehow.

If compute node is hosting pets …

… we have to resurrect very carefully in order to avoid any zombie pets!

Do we really need compute HA in OpenStack?

Why?● Compute HA needed for cattle as well as pets● Valid reasons for running pets in OpenStack

● Manageability benefits● Want to avoid multiple virtual estates● Too expensive to cloudify legacy workloads

Architectural Challenges

Configurability

Different cloud operators will want to support different SLAs with different workflows, e.g.● Protection for pets:

– per availability zone?

– per project?

– per pet?

• If nova-compute fails, VMs are still perfectly healthy but unmanageable

– Should they be automatically killed? Depends on the workload.

Compute Plane Needs to Scale

CERN datacenter © Torkild Retvedt CC-BY-SA 2.0

Full Mesh Clusters Don't Scale

Addressing Scalability

The obvious workarounds are ugly!● Multiple compute clusters introduce unwanted artificial boundaries● Clusters inside / between guest VM instances are not OS-agnostic, and

require cloud users to modify guest images (installing & configuring cluster software)

● Cloud is supposed to make things easier not harder!

Common Architecture

pacemaker

Recovery workflowcontroller

nova-api

Control plane

Compute plane

Recoverystate

database

pacemaker_remote

VM 1 VM 2 VM 3

libvirt

nova-compute

pacemaker_remote

VM 1 VM 2 VM 3

libvirt

nova-compute

pacemaker_remote

VM 1 VM 2 VM 3

libvirt

nova-compute

pacemaker

nova-api

Control plane

Compute plane

Recoverystate

database

pacemaker_remote

VM 1 VM 2 VM 3

libvirt

nova-compute

pacemaker_remote

VM 1 VM 2 VM 3

libvirt

nova-compute

pacemaker_remote

VM 1 VM 2 VM 3

libvirt

nova-compute

Reliability Challenges pacemaker

nova-api

Control plane

Compute plane

Recoverystate

database

pacemaker_remote

VM 1 VM 2 VM 3

libvirt

nova-compute

pacemaker_remote

VM 1 VM 2 VM 3

libvirt

nova-compute

pacemaker_remote

VM 1 VM 2 VM 3

libvirt

nova-compute

Reliability Challenges

Compute HA in SUSE OpenStack Cloud

NovaCompute / NovaEvacuate OCF Agents

pacemaker

nova-api

Control plane

Compute plane

Recoverystate

database

pacemaker_remote

VM 1 VM 2 VM 3

libvirt

nova-compute

pacemaker_remote

VM 1 VM 2 VM 3

libvirt

nova-compute

pacemaker_remote

VM 1 VM 2 VM 3

libvirt

nova-compute

NovaEvacuate

nova-api

Control plane

Compute plane

PacemakerCIB

pacemaker_remote

VM 1 VM 2 VM 3

libvirt

nova-compute NovaCompute

pacemaker_remote

VM 1 VM 2 VM 3

libvirt

pacemaker_remote

VM 1 VM 2 VM 3

libvirt

OCF Resource Agents

fence_compute

NovaCompute / NovaEvacuate OCF Agents

Pros● Ready for production now● Commercially supported by SUSE● RAs upstream in openstack-resource-agents repo

Cons● Known limitations (known bugs):

– Only handles failure of compute node, not of VMs, or nova-compute

– Some corner cases still problematic, e.g. if nova fails during recovery

Brief Interlude: nova evacuate

nova’s Recovery API

pacemaker

nova-api

Control plane

Compute plane

Recoverystate

database

pacemaker_remote

VM 1 VM 2 VM 3

libvirt

nova-compute

pacemaker_remote

VM 1 VM 2 VM 3

libvirt

nova-compute

pacemaker_remote

VM 1 VM 2 VM 3

libvirt

nova-compute

Public Health Warning

nova evacuate does not really mean evacuation!

Think About Natural Disasters

Not too late to evacuate

Too late to evacuate

nova Terminology

nova live-migration

nova evacuate ?!

Public Health Warning

● In Vancouver, nova developers considered a rename● Has not happened yet● Due to impact, seems unlikely to happen any time soon

Whenever you see “evacuate” in a nova-related context, pretend you saw “resurrect”

Shared Storage

Where can we have Shared Storage?

Two key areas:● /var/lib/glance/images on controller nodes● /var/lib/nova/instances on compute nodes

When do we need Shared Storage?

● If /var/lib/nova/instances is shared:● VM's ephemeral disk will be preserved during recovery

● Otherwise:● VM disk will be lost● recovery will need to rebuild VM from image

● Either way, /var/lib/glance/images should be shared across all controllers (unless using Swift / Ceph)● otherwise nova might fail to retrieve image from glance

Questions?

HA for OpenStack, from the control plane to...

Documents

Transcript of HA for OpenStack, from the control plane to...

Red Hat Enterprise Linux OpenStack Platform 7 - VM Instance HA Architecture

Automated Deployment of an HA OpenStack Cloud · Automated Deployment of an HA OpenStack Cloud with SUSE ® Cloud HO7695 Adam Spiers Senior Software Engineer aspiers@suse.com Vincent

Virtual CPE Framework-1 - Altencalsoft€¦ · Control Plane Orchestration † Virtual Machines (VMs) managed by OpenStack † Linux Containers (LXCs) managed by Docker, Kubernetes

Openstack Ha Guide Trunk

Chef cookbooks for OpenStack HA

Comparison of control plane deployment architectures in ......OpenStack and non OpenStack management functions share the same HW resources Compressed Data Plane Hyperconverged Compute

HA in OpenStack service - meetup #9

MGT2609BU VMware Integrated OpenStack 4.0: or …€™s New in VMware Integrated OpenStack 4.0 17 VM Resize, OMS LVM, vCenter HA, vRA integration ... Persistent storage, LBaaS and

BARBICAN NOVA CHEF OPENSTACK OPENSTACK CHARMS CINDER OPENSTACK-ANSIBLE · 2019-02-26 · openstack charms openstack-ansible oslo packaging-deb packaging-rpm puppet openstack quality

Clouds using CephFS Distributed File Storage in Multi-Tenant · Public OpenStack Service API (External) network - control plane Private Storage network (data plane) Controller 0 MariaDB

Understanding Red Hat OpenStack · you will need to interact with services using both the HA framework and the systemctl command. Red Hat OpenStack ... Understanding Red Hat OpenStack

Network in OpenStack: changes since Cactus and CloudPipe HA

Build a HA IaaS Platform Based on OpenStack using …...Build a HA IaaS Platform Based on OpenStack using OpenDaylight Hideyuki Tai - OpenDaylight contributor ... OpenFlow plugin project

Access Transformation(e.g. Openstack, OpenNebula, Dockers, Kubernetes) VOLTHA SDN Control (e.g. ONOS) Data-plane Network functions Compute plane Network functions OCP OLT blades Spine

HA for OpenStack: Connecting the dots - Meetupfiles.meetup.com/2979972/HAforOpenStackDC_20140123.pdf · Rags • Solutions Architect at Rackspace for OpenStack-based Rackspace Private

SOSCON2017 From Kubernetes to OpenStack 171026 · 5 Demo System deploy node k1-master01 k1-master02 k1-master03 k1-node01 k1-node02 k1-node03 k1-node04 Label : openstack-control-plane=enabled

OpenStack Lab Environment · Packstack •Deploy of OpenStack •Puppet •Requires pre-installed machines with SSH access •No HA support •PoC . Generate answers file

Experience of developing Cloud service for Video ...2015.secrus.org/2015/files/122_konovalov.pdf · Presentation Plane . Architecture ... Ceph • Object storage: Ceph, OpenStack

Reimagining OpenStack* NA 2016.pdfNova/Cinder/Glance API subset Users & Front end Control plane Compute resources Storage Networking Horizon WebUI OpenStack CLI ciao WebUI ciao CLI

OpenStack Deployment and Operations Guide …netapp.github.io/openstack-deploy-ops-guide/ocata/openstack... · OpenStack Block Storage (project name: Cinder) OpenStack Block Storage