Using OpenStack Swift for Extreme Data Durability

25
Using OpenStack Swift for extreme data durability Florent Flament, Cloudwatt Christian Schwede, eNovance OpenStack Summit Paris, November 2014

Transcript of Using OpenStack Swift for Extreme Data Durability

Using OpenStack Swift for extreme data durability

Florent Flament, CloudwattChristian Schwede, eNovance

OpenStack Summit Paris, November 2014

Intro - Cloudwatt● Florent Flament

● Dev & Fireman @ Cloudwatt

● Fixing & tuning of OpenStack (Cinder, Keystone, Nova, Swift)

● Email: [email protected]

● IRC: florentflament on #openstack-dev (Freenode)

● Twitter: @florentflament_

● Blogs: http://dev.cloudwatt.com / http://www.florentflament.com

Intro - eNovance● Christian Schwede

● Developer @ eNovance / Red Hat

● Mostly working on Swift, testing, automation and developer tools

● Swift Core

● IRC: cschwede in #openstack-swift

[email protected] / [email protected]

● Twitter: @cschwede_de

Architecture

Proxy Node

Proxy Node

Network

Disk

Disk

Disk

Disk

Disk

Disk

Disk

Disk

Disk

Disk

Disk

Disk

Disk

Disk

Disk

Disk

Disk

Disk

Proxy Node

Proxy Node

Network

Zone 0 Zone 1 Zone 2

Disk

Disk

Disk

Disk

Disk

Disk

Disk

Disk

Disk

Disk

Disk

Disk

Disk

Disk

Disk

Disk

Disk

Disk

Proxy Node

Proxy Node

Network

Zone 0 Zone 1

Region 0 (⅔ of the data)

Zone 2

Region 1 (⅓ of the data)

Disk

Disk

Disk

Disk

Disk

Disk

Disk

Disk

Disk

Disk

Disk

Disk

Disk

Disk

Disk

Disk

Disk

Disk

Disk

Disk

Disk

Disk

Disk

Disk

Zone 0

Disk

Disk

Disk

The Ring

Ring : the Map of data● One file per type of data. Ring files map each copy of a

data to a physical device through partitions.

● An object’s partition number is computed from the hash

of the object’s name.

● A Ring file is: a (replica, partition) to device ID table, a

devices table and a number of hash bits.

● Visualize a Ring: https://github.com/victorlin/swiftsense

Concrete example of Ring

0 1 2 3 0 1 2 3

1 2 3 0 1 2 3 0

2 3 0 1 2 3 0 1

Partition number

0

1

2

Rep

lica

num

ber

0 1 2 3 4 5 6 7

Replica & Partition to Device ID table Devices table

ID Host Port Device

0 192.168.0.10 6000 sdb1

1 192.168.0.10 6000 sdc1

2 192.168.0.11 6000 sdb1

3 192.168.0.11 6000 sdc1

Bit count (partition power) = 3→ 23 = 8 partitions

Storage policies● Included in the Juno release (Swift > 2.0.0)

● Applied on a per-container basis

● Flexibility to use multiple rings, for example:

○ Basic: 2 replicas on spinning disks, single datacenter

○ Strong: 3 replicas in three different datacenters around the globe

○ Fast: 3 replicas on SSDs and much more powerful proxies

Availability & Durability

Object durability● Disk failures: pd ~ 2-5% per year

● Unrecoverable bit read errors: pb = 10-15 ⋅ 8 ⋅ objectsize

3 replicas 2 replicas 1 replica Data loss

Failure Failure Failure

Replication ReplicationReplication

● Durability in the range of 10-11 nines with 3 replicas (99.99999999%)

● http://enovance.github.io/swift-durability-calculator/

Recover from a disk failure

Set failed device weight to 0, rebalance, push new ring

Failed

Object availability & durability

Zone 0 Zone 1

Region 0 (⅔ of the data)

Zone 2

Region 1 (⅓ of the data)

Disk

Disk

Disk

Disk

Disk

Disk

Disk

Disk

Disk

Disk

Disk

Disk

Disk

Disk

Disk

Disk

Disk

Disk

Disk

Disk

Disk

Disk

Disk

Disk

Zone 0

Disk

Disk

Disk

Maintenance

Maintainability by Simplicity● Standalone `swift-ring-builder` tool to manipulate the Ring

○ Uses `builders` files to keep architectural information on the cluster

○ Smartly assigns partitions to devices

○ Generates Ring files easily checked

● Processes on Swift nodes focus on ensuring that files are stored

uncorrupted at the appropriate location

Splitting a running Swift Cluster● Ensuring no data is lost

○ Move only 1 replica at a time

○ Small steps to limit the impact

○ Check for data corruption

○ Check data location

○ Rollback in case of failure

● Limiting the impact on performance

○ Availability of cluster resources

○ Load incurred by cluster being split

○ Small steps to limit the impact

○ Control nodes accessed by users

Natively available in Swift

Splitting a running Swift Cluster● Ensuring no data is lost

○ Move only 1 replica at a time

○ Small steps to limit the impact

○ Check for data corruption

○ Check data location

○ Rollback in case of failure

● Limiting the impact on performance

○ Availability of cluster resources

○ Load incurred by cluster being split

○ Small steps to limit the impact

○ Control nodes accessed by users

Small stepsNew in Swift 2.2 !!

Example of process:

1. Add devices to new region with a very low weight

2. Increase devices’ weights to store 5% of data in the new region

3. Progressively increase by steps of 5% the amount of data in the new region

More details: http://www.florentflament.com/blog/splitting-swift-cluster.html

Add a new region smoothly by limiting the amount of data moved

● really possible since Swift 2.2

● Final weight in new region should be at least ⅓ of the total cluster weight

Adding a new region

Outlook & Summary

Erasure coding● Coming real soon now

● Instead of N copies of each object:

○ apply EC to object, split into multiple fragments, for example 14

○ store them on different disks/nodes

○ objects can be rebuild from 10 fragments

■ Tolerates loss of 4 fragments

● higher durability

■ Only ~ 40% overhead (compared to 200%)

● much cheaper

Durability calculation● More detailed calculation

○ Number of disks, servers, partitions

● Add erasure coding

● Include in Swift documentation?

● Community effort

○ Discussion started last Swift hackathon

■ NTT, Swiftstack, IBM, Seagate, Red Hat / eNovance

○ Ad-Hoc session on Thursday/Friday - join us!

Summary● High availability, even if large parts of the cluster are not accessible

● Automatic failure correction ensures high durability, and depending on

your cluster configuration excels known industry standards

● Swift 2.2 (Juno release)

○ Even smoother and predictable cluster upgrades

○ Storage Policies allow fine grained data placement control

● Erasure Coding increase durability even more while lowering costs