INSTANCE? DUDE, THIS ISN’T WHERE I PARKED MY · Starts the instance there and removes it from the...

36
DUDE, THIS ISN’T WHERE I PARKED MY INSTANCE? Moving instances around your OpenStack cloud for fun and profit. Stephen Gordon (@xsgordon) Sr. Technical Product Manager, Red Hat October 29th, 2015

Transcript of INSTANCE? DUDE, THIS ISN’T WHERE I PARKED MY · Starts the instance there and removes it from the...

Page 1: INSTANCE? DUDE, THIS ISN’T WHERE I PARKED MY · Starts the instance there and removes it from the source hypervisor. Instance’s current host must be operational. Like resize requires

DUDE, THIS ISN’T WHERE I PARKED MY INSTANCE?

Moving instances around your OpenStack cloud for fun and profit.

Stephen Gordon (@xsgordon)Sr. Technical Product Manager, Red Hat

October 29th, 2015

Page 2: INSTANCE? DUDE, THIS ISN’T WHERE I PARKED MY · Starts the instance there and removes it from the source hypervisor. Instance’s current host must be operational. Like resize requires

2

● What are we moving? *● Why are we moving instances?● How are we moving instances?● What new enhancements do we get in:

○ Liberty?○ Mitaka?

* #spoileralert: instances

AGENDA

Page 3: INSTANCE? DUDE, THIS ISN’T WHERE I PARKED MY · Starts the instance there and removes it from the source hypervisor. Instance’s current host must be operational. Like resize requires

WHAT ARE WE MOVING?

Page 4: INSTANCE? DUDE, THIS ISN’T WHERE I PARKED MY · Starts the instance there and removes it from the source hypervisor. Instance’s current host must be operational. Like resize requires

4

GUEST CONFIGURATION

● Guest configuration including vCPUs, memory, devices etc.

GUESTSTORAGE

● Initial image or volume.

WHAT ARE WE MOVING?What is an instance (“server”)?

All paths for moving instances involve moving some subset of these elements.

GUESTSTATE

● In-memory state.● On-disk state.

Page 5: INSTANCE? DUDE, THIS ISN’T WHERE I PARKED MY · Starts the instance there and removes it from the source hypervisor. Instance’s current host must be operational. Like resize requires

WHY ARE WE MOVING INSTANCES?

Page 6: INSTANCE? DUDE, THIS ISN’T WHERE I PARKED MY · Starts the instance there and removes it from the source hypervisor. Instance’s current host must be operational. Like resize requires

6

WHEN PERFORMING NODE MAINTENANCE

● Adding hardware● Updating software● Response to imminent

failure

IN REACTION TO NODE FAILURE

● Host lost power● Host lost connectivity● Host otherwise went

down (e.g. DC fire)

FOR CAPACITY MANAGEMENT

● Consolidate or spread instances to save power or avoid resource contention issues respectively.

WHY ARE WE MOVING INSTANCES?Moving instances is an operational tool for use...

Page 7: INSTANCE? DUDE, THIS ISN’T WHERE I PARKED MY · Starts the instance there and removes it from the source hypervisor. Instance’s current host must be operational. Like resize requires

HOW ARE WE MOVING INSTANCES?

Page 8: INSTANCE? DUDE, THIS ISN’T WHERE I PARKED MY · Starts the instance there and removes it from the source hypervisor. Instance’s current host must be operational. Like resize requires

8

$ nova help | grep -E '(migrat|evacuat)'

evacuate Evacuate server from failed host.

live-migration Migrate running server to a new machine.

migrate Migrate a server. The new host will be..

migration-list Print a list of migrations.

host-servers-migrate Migrate all instances of the specified host to...

host-evacuate Evacuate all instances from failed host.

host-evacuate-live Live migrate all instances of the specified host to...

MECHANISMS FOR MOVING INSTANCESLet me google that for you!

Page 9: INSTANCE? DUDE, THIS ISN’T WHERE I PARKED MY · Starts the instance there and removes it from the source hypervisor. Instance’s current host must be operational. Like resize requires

9

$ nova help | grep -E '(migrat|evacuat)'

evacuate Evacuate server from failed host.

live-migration Migrate running server to a new machine.

migrate Migrate a server. The new host will be..

migration-list Print a list of migrations.

host-servers-migrate Migrate all instances of the specified host to...

host-evacuate Evacuate all instances from failed host.

host-evacuate-live Live migrate all instances of the specified host to...

MECHANISMS FOR MOVING INSTANCESLet me google that for you!

Page 10: INSTANCE? DUDE, THIS ISN’T WHERE I PARKED MY · Starts the instance there and removes it from the source hypervisor. Instance’s current host must be operational. Like resize requires

10

EVACUATE

Rebuild an instance that is currently on a compute node

that is down on a different compute node.

MIGRATE

Rebuild* an instance that is currently on a compute node

that is up on a different compute node**.

LIVE-MIGRATION

Move an instance to a different compute node

without downtime.

MECHANISMS FOR MOVING INSTANCES

* By rebuild we really mean resize.

** Where this behavior will change if you turn on resizing to the same host (off by default)

Page 11: INSTANCE? DUDE, THIS ISN’T WHERE I PARKED MY · Starts the instance there and removes it from the source hypervisor. Instance’s current host must be operational. Like resize requires

11

HOST-EVACUATE

Rebuild all instances that are currently on a compute node

that is down on another compute node.

HOST-SERVERS-MIGRATE

Rebuild* all instances that are currently on a compute node

that is up on another compute node**.

HOST-EVACUATE-LIVE

Move all instances on a compute node to another

compute node without downtime.

HELPERS FOR MOVING INSTANCES

* By rebuild we really mean resize.

** Where this behavior will change if you turn on resizing to the same host (off by default)

Page 12: INSTANCE? DUDE, THIS ISN’T WHERE I PARKED MY · Starts the instance there and removes it from the source hypervisor. Instance’s current host must be operational. Like resize requires

EVACUATION

Page 13: INSTANCE? DUDE, THIS ISN’T WHERE I PARKED MY · Starts the instance there and removes it from the source hypervisor. Instance’s current host must be operational. Like resize requires

13

● Works when compute node hosting instance fails due to a hardware failure or other issue.

● Rebuilds instance on a new compute node either selected by the scheduler or optionally the user initiating the evacuation.○ Benefit over and above starting afresh is keeping same UUID, IP etc.

● Requires that Nova recognizes the source compute node is down.● Requires shared storage to maintain user data on disk (not mandatory).● Allows injecting a new admin password (if shared storage is not being used).

EVACUATION nova evacuate [--password <password>] [--on-shared-storage] <server> [<host>]

Page 14: INSTANCE? DUDE, THIS ISN’T WHERE I PARKED MY · Starts the instance there and removes it from the source hypervisor. Instance’s current host must be operational. Like resize requires

14

$ nova evacuate instance-001

+-----------+--------------+

| Property | Value |

+-----------+--------------+

| adminPass | pjaDV46p94Nz |

+-----------+--------------+

$

EVACUATION nova evacuate [--password <password>] [--on-shared-storage] <server> [<host>]

Page 15: INSTANCE? DUDE, THIS ISN’T WHERE I PARKED MY · Starts the instance there and removes it from the source hypervisor. Instance’s current host must be operational. Like resize requires

COLD MIGRATION

Page 16: INSTANCE? DUDE, THIS ISN’T WHERE I PARKED MY · Starts the instance there and removes it from the source hypervisor. Instance’s current host must be operational. Like resize requires

16

● Works when compute node hosting instance is up (at least to start with…).● Rebuilds instance on a new host selected by the scheduler.

○ Actually uses the resize path in the code base.○ Shuts down instance.○ Copies disk to the new compute node.○ Starts the instance there and removes it from the source hypervisor.

● Instance’s current host must be operational.● Like resize requires a manual confirmation step.● Unlike evacuation and live migration doesn’t allow specification of target host to

override scheduler.

COLD MIGRATIONnova migrate [--poll] <server>

Page 17: INSTANCE? DUDE, THIS ISN’T WHERE I PARKED MY · Starts the instance there and removes it from the source hypervisor. Instance’s current host must be operational. Like resize requires

17

$ nova migrate instance-001 --poll

Server migrating... 100% complete

Finished

$ nova list

+--------------+--------------+---------------+------------+-------------+ ...

| ID | Name | Status | Task State | Power State | ...

+--------------+--------------+---------------+------------+-------------+ ...

| 5819a2e0-... | instance-001 | VERIFY_RESIZE | - | Running | ...

+--------------+--------------+---------------+------------+-------------+ ...

$ nova resize-confirm instance-001

COLD MIGRATIONnova migrate [--poll] <server>

Page 18: INSTANCE? DUDE, THIS ISN’T WHERE I PARKED MY · Starts the instance there and removes it from the source hypervisor. Instance’s current host must be operational. Like resize requires

LIVE MIGRATION

Page 19: INSTANCE? DUDE, THIS ISN’T WHERE I PARKED MY · Starts the instance there and removes it from the source hypervisor. Instance’s current host must be operational. Like resize requires

19

● Moves powered on virtual machine to a new compute node without any (noticeable) downtime.

● Two approaches to live migration:○ Using shared storage (including volume-based).

■ Requires either /var/lib/nova/instances/ to be on shared storage (e.g. NFS, GlusterFS, Ceph, etc.)across all compute nodes in the migration domain; or

■ Volume-backed instances■ Still requires memory state transfer/sync

○ Using block migration.■ Direct transfer/sync of not just memory state but also disks from source

compute node to destination

LIVE MIGRATION$ nova live-migration [--block-migrate] [--disk-over-commit] <server> [<host>]

Page 20: INSTANCE? DUDE, THIS ISN’T WHERE I PARKED MY · Starts the instance there and removes it from the source hypervisor. Instance’s current host must be operational. Like resize requires

20

1. Scheduler selects destination host, unless user specified2. Check migration source and destination (disk, ram, cpu model, mapped volumes)3. Iterative pre-copy, copying memory pages from the active virtual machine on the source

to a new paused instance on the destination4. Source instance is paused while remaining memory pages and CPU state is copied.5. Destination instance is started, source is cleaned up

LIVE MIGRATION - HOW IT WORKS

Page 21: INSTANCE? DUDE, THIS ISN’T WHERE I PARKED MY · Starts the instance there and removes it from the source hypervisor. Instance’s current host must be operational. Like resize requires

21

● Maximum performance is obtained by exposing as many host CPU features to the guest as possible

● Live migration will fail if destination host is not able to expose the same CPU features to guests as the source host

● Performance versus Flexibility trade-off● Nova provides configuration keys, including libvirt_cpu_mode, for deployers to make

the performance versus flexibility trade-off for their environment○ host-passthrough○ host-model○ custom

LIVE MIGRATION - HOW IT DOESN’T WORKCPU mode/model compatibility

Page 22: INSTANCE? DUDE, THIS ISN’T WHERE I PARKED MY · Starts the instance there and removes it from the source hypervisor. Instance’s current host must be operational. Like resize requires

22

$ virsh cpu-models x86_64

...

SandyBridge

Westmere

Nehalem

...

$ grep ‘libvirt_cpu_mode’ /etc/nova/nova.conf

libvirt_cpu_mode = custom

libvirt_cpu_model = Sandybridge

LIVE MIGRATION - HOW IT DOESN’T WORKCPU mode/model compatibility

Can also use qemu-kvm -cpu help

Page 23: INSTANCE? DUDE, THIS ISN’T WHERE I PARKED MY · Starts the instance there and removes it from the source hypervisor. Instance’s current host must be operational. Like resize requires

23

● Incompatible QEMU machine types● Inconsistent networking configuration

○ Source hypervisor must be able to hit destination’s live_migration_uri and vice versa (live_migration_uri = qemu+tcp://%s/system)

● Inconsistent clocks○ Synchronize clocks using ntp or chronyd

● Incompatible VNC listening addresses● Incompatible or no SSH tunnelling configuration

LIVE MIGRATION OTHER WAYS TO FAIL

Page 24: INSTANCE? DUDE, THIS ISN’T WHERE I PARKED MY · Starts the instance there and removes it from the source hypervisor. Instance’s current host must be operational. Like resize requires

24

● Migrations take too long or fail to complete.● Many common user operations are not supported during migration (e.g. pause).● Need to use virsh, bypassing Nova, to:

○ Control a running migration (e.g. throttle or cancel)○ Monitor a running migration○ Tune migration max downtime

● Certain instance configurations can not be migrated.○ Use a config drive (e.g. config_drive_format=iso9960) or mix local/remote

storage○ Use passed through devices associated with them (SR-IOV, GPU, etc.)

● Live migration doesn’t correctly account for overcommit when checking destination host validity.

● Tenant admin initiating needs to know if shared or block storage available.

LIVE MIGRATION - OTHER OPERATOR ISSUES

Page 25: INSTANCE? DUDE, THIS ISN’T WHERE I PARKED MY · Starts the instance there and removes it from the source hypervisor. Instance’s current host must be operational. Like resize requires

LIBERTY

Page 26: INSTANCE? DUDE, THIS ISN’T WHERE I PARKED MY · Starts the instance there and removes it from the source hypervisor. Instance’s current host must be operational. Like resize requires

26

● Primary factors in determining how long it will take to migrate a guest:○ Amount of guest RAM○ Speed with which guest RAM is being dirtied○ Speed of the migration network

● Previously live migrations in OpenStack ran with fixed maximum downtime as determined by QEMU.

● As of Liberty:○ The downtime allowable is scaled up exponentially (to a limit) to allow a better

chance for completion.○ The number of concurrent outbound live migrations is limited○ The number of concurrent inbound build requests is limited

● QEMU endeavors to estimate when the number of dirty pages is low enough to finalize

LONG RUNNING LIVE MIGRATIONSI’m gonna let you finish...but...

Page 27: INSTANCE? DUDE, THIS ISN’T WHERE I PARKED MY · Starts the instance there and removes it from the source hypervisor. Instance’s current host must be operational. Like resize requires

27

● Scaling downtime to finalize migration:○ live_migration_downtime - Maximum permitted guest downtime for switchover (minimum

100ms)○ live_migration_downtime_steps - Number of incremental steps to reach max downtime

value (minimum 3)○ live_migration_downtime_delay - Time to wait, in seconds, between each step in increase

of max downtime (minimum 10s)● Timeouts:

○ live_migration_completion_timeout - Time to wait (in seconds) for migration to complete (default 800 seconds, 0 means no timeout) - is scaled by GB of guest RAM

○ live_migration_progress_timeout - Time to wait (in seconds) for migration to make forward progress (default 150 seconds).

LONG RUNNING LIVE MIGRATIONSNew configuration keys to control this behavior...

Page 28: INSTANCE? DUDE, THIS ISN’T WHERE I PARKED MY · Starts the instance there and removes it from the source hypervisor. Instance’s current host must be operational. Like resize requires

28

● Concurrent operations:○ max_concurrent_live_migrations - Maximum outbound live migrations to run concurrently,

defaults to 1. Do not change unless absolutely sure.○ max_concurrent_builds - Maximum inbound instance builds to run concurrently, defaults to

10.

LONG RUNNING LIVE MIGRATIONSNew configuration keys to control this behavior...

Page 29: INSTANCE? DUDE, THIS ISN’T WHERE I PARKED MY · Starts the instance there and removes it from the source hypervisor. Instance’s current host must be operational. Like resize requires

29

● Delay between steps is set to 30 * 3 (seconds of delay * GB of RAM).○ 0 seconds -> set downtime to 37ms○ 90 seconds -> set downtime to 38ms○ 180 seconds -> set downtime to 39ms○ 270 seconds -> set downtime to 42ms○ 360 seconds -> set downtime to 46ms○ 450 seconds -> set downtime to 55ms○ 540 seconds -> set downtime to 70ms○ 630 seconds -> set downtime to 98ms○ 720 seconds -> set downtime to 148ms○ 810 seconds -> set downtime to 238ms○ 900 seconds -> set downtime to 400ms

LONG RUNNING LIVE MIGRATIONS EXAMPLE400 millisecond max, 10 steps, 30 second delay, 3 GB guest

Page 30: INSTANCE? DUDE, THIS ISN’T WHERE I PARKED MY · Starts the instance there and removes it from the source hypervisor. Instance’s current host must be operational. Like resize requires

30

● Liberty provides a mechanism for external tools to report into Nova when a node has failed (“mark host down”/”force down” API call)

● As soon as host has been explicitly marked down evacuation can commence, triggered by the external tool.

● Used to provide “instance high availability” using e.g. Pacemaker.○ http://redhatstackblog.redhat.com/2015/09/24/highly-available-virtual-

machines-in-rhel-openstack-platform-7/

MARK HOST DOWN API CALL

Page 31: INSTANCE? DUDE, THIS ISN’T WHERE I PARKED MY · Starts the instance there and removes it from the source hypervisor. Instance’s current host must be operational. Like resize requires

MITAKA AND BEYOND

Page 32: INSTANCE? DUDE, THIS ISN’T WHERE I PARKED MY · Starts the instance there and removes it from the source hypervisor. Instance’s current host must be operational. Like resize requires

32

Short Term

● CI coverage● Improve API documentation● Support for migrating instances with mixed storage● Support for pausing (and perhaps cancelling) migrations● Better resource tracking● Use Libvirt storage pools instead SSH for migrate/resize.

○ Enabler for other work including migrating suspended instances.● Correct memory overcommit handling for live migration.

Mid to Long Term

● TLS encryption (work underway in QEMU)● Auto-convergence - adjusting instance activity to help complete migration● Post copy migration - start instance at destination and then copy memory over on demand

CURRENTLY UNDER DISCUSSION

Page 33: INSTANCE? DUDE, THIS ISN’T WHERE I PARKED MY · Starts the instance there and removes it from the source hypervisor. Instance’s current host must be operational. Like resize requires

Q & A

Page 34: INSTANCE? DUDE, THIS ISN’T WHERE I PARKED MY · Starts the instance there and removes it from the source hypervisor. Instance’s current host must be operational. Like resize requires

34

● Where can I find the slides?○ http://www.slideshare.net/sgordon2

● Where can I submit anonymised feedback?○ Session Feedback Survey in the official OpenStack Summit App

● Where can I contact you?○ Twitter: @xsgordon○ Email: [email protected]○ IRC: sgordon on irc.freenode.net

● How can I get involved?○ https://etherpad.openstack.org/p/mitaka-live-migration

FAQ

Page 35: INSTANCE? DUDE, THIS ISN’T WHERE I PARKED MY · Starts the instance there and removes it from the source hypervisor. Instance’s current host must be operational. Like resize requires

THANK YOU

plus.google.com/+RedHat

linkedin.com/company/red-hat

youtube.com/user/RedHatVideos

facebook.com/redhatinc

twitter.com/RedHatNews

@xsgordon - Stephen Gordon

Page 36: INSTANCE? DUDE, THIS ISN’T WHERE I PARKED MY · Starts the instance there and removes it from the source hypervisor. Instance’s current host must be operational. Like resize requires

36

● Outstanding work items:○ Etherpad: https://etherpad.openstack.org/p/mitaka-live-migration○ Bug list: https://docs.google.

com/spreadsheets/d/19MFatOpjePS4JtkVHXCh6Qa8XUf6T2t0Igy1PucZ3Zk/edit#gid=2127877307

● Past presentations:○ Live Migration at HP Public Cloud:

■ https://www.openstack.org/summit/vancouver-2015/summit-videos/presentation/live-migration-at-hp-public-cloud

○ Intel Dive into VM Live Migration:■ https://www.openstack.org/summit/vancouver-2015/summit-

videos/presentation/dive-into-vm-live-migration

RECOMMENDED READING, VIEWING, AND REFERENCES