SF Ceph Users Jan. 2014

38
SF BAY AREA CEPH USERS GROUP INAUGURAL MEETUP Thursday, January 16, 14

description

Ceph, being a distributed storage system, is highly reliant on the network for resiliency and performance. In addition, it is crucial that the network topology beneath a Ceph cluster be designed in such a way to facilitate easy scaling without service disruption. After an introduction to Ceph itself this talk will dive into the design of Ceph client and cluster network topologies.

Transcript of SF Ceph Users Jan. 2014

Page 1: SF Ceph Users Jan. 2014

SF BAY AREA CEPH USERS GROUP

INAUGURAL MEETUP

Thursday, January 16, 14

Page 2: SF Ceph Users Jan. 2014

AGENDA

2

Network Hardware

Cluster Topologies

Public Topologies

Ceph Networking

Intro to Ceph

Thursday, January 16, 14

Page 3: SF Ceph Users Jan. 2014

THE FORECAST

By 2020over 39 ZB of data will be stored.1.5 ZB are stored today.

3

Page 4: SF Ceph Users Jan. 2014

THE PROBLEM

Existing systems don’t scale

Increasing cost and complexity

Need to invest in new platforms ahead of time

2010 2020

IT Storage Budget

Growth of data

4

Thursday, January 16, 14

Page 5: SF Ceph Users Jan. 2014

THE SOLUTION

PAST: SCALE UP

FUTURE: SCALE OUT

5

Thursday, January 16, 14

Page 6: SF Ceph Users Jan. 2014

CEPHThursday, January 16, 14

Page 7: SF Ceph Users Jan. 2014

INTRO TO CEPH

7

Distributed storage system

Horizontally scalable

No single point of failure

Self healing and self managing

Runs on commodity hardware

GPLv2 License

Thursday, January 16, 14

Page 8: SF Ceph Users Jan. 2014

ARCHITECTURE

8

Thursday, January 16, 14

Page 9: SF Ceph Users Jan. 2014

SERVICE COMPONENTS

PAXOS for consensus Maintain cluster state Typically 3-5 nodes NOT in write path

MONITOR

9

Object storage interface Gossips with peers Data lives here

OSD

PART 1

OSD

Thursday, January 16, 14

Page 10: SF Ceph Users Jan. 2014

RADOS GATEWAY

METADATA

SERVICE COMPONENTS

10

Object storage interface Gossips with peers Dynamic subtree partitioning

Provides S3/Swift compatibility Scale out

PART 2

Thursday, January 16, 14

Page 11: SF Ceph Users Jan. 2014

CRUSH

Ceph uses CRUSH for data placement

Aware of cluster topography

Statistically even distribution across pool

Supports asymmetric nodes and devices

Hierarchal weighting

11

Thursday, January 16, 14

Page 12: SF Ceph Users Jan. 2014

DATA PLACEMENT

12

Thursday, January 16, 14

Page 13: SF Ceph Users Jan. 2014

POOLS

13

Groupings of OSDs

Both physical and logical

Volumes / Images

Hot SSD pool

Cold SATA pool

DMCrypt pool

Thursday, January 16, 14

Page 14: SF Ceph Users Jan. 2014

REPLICATION

14

Original data durability mechanism

Ceph creates N replicas of each RADOS object

Uses CRUSH to determine replica placement

Required for mutable objects (RBD, CephFS)

More reasonable for smaller installations

Thursday, January 16, 14

Page 15: SF Ceph Users Jan. 2014

ERASURE CODING

15

(8:4) MDS code in example

1.5x overhead

8 units of client data to write

4 parity units generated using FEC

All 12 units placed with CRUSH

8/12 total units to satisfy a read

Firefly Release

Thursday, January 16, 14

Page 16: SF Ceph Users Jan. 2014

CLIENT COMPONENTS

16

Mutable object store Many language bindings Object classes

Native API

Linux Kernel CephFS client since 2.6.34 FUSE client Hadoop JNI bindings

CephFS

Thursday, January 16, 14

Page 17: SF Ceph Users Jan. 2014

S3/SWIFT

Linux Kernel RBD client since 2.6.37+ KVM/QEMU integration Xen integration

Block Storage

OSDS3/Swift

CLIENT COMPONENTS

17

RESTful interfaces (HTTP) CRUD operations Usage accounting for billing

Thursday, January 16, 14

Page 18: SF Ceph Users Jan. 2014

Ceph NetworkingThursday, January 16, 14

Page 19: SF Ceph Users Jan. 2014

INFINIBAND

Currently only supported via IPoIB

Accelio (libxio) integration in Ceph is in early stages

Accelio supports multiple transports RDMA, TCP and Shared-Memory

Accelio supports multiple RDMA transports (IB, RoCE, iWARP)

19

Thursday, January 16, 14

Page 20: SF Ceph Users Jan. 2014

ETHERNET

Tried and true

Proven at scale

Economical

Many suitable vendors

20

Thursday, January 16, 14

Page 21: SF Ceph Users Jan. 2014

10GbE or 1GbE

Cost of 10GbE trending downward

White box switches turning up heat on vendors

Twinax relatively inexpensive and low power

SFP+ is versatile wrt distance

Single 10GbE for object

Dual 10GbE for block storage (public/cluster)

Bonding many 1GbE links adds lots of complexity

21

Thursday, January 16, 14

Page 22: SF Ceph Users Jan. 2014

IPv4 or IPv6 Native

It’s 2014, is this really a question?

Ceph fully supports both modes of operation

Hierarchal allocation models allows “roll up” of routes

Optimal efficiency in RIB

Some tools believe the earth is flat

22

Thursday, January 16, 14

Page 23: SF Ceph Users Jan. 2014

LAYER 2

Spanning tree

Switch table size

Broadcast domains (ARP)

MAC frame checksum

Storage protocols (FCoE, ATAoE)

TRILL, MLAG

Layer 2 DCI is crazy pants

Layer 2 tunneled over internet is super crazy pants

23

Thursday, January 16, 14

Page 24: SF Ceph Users Jan. 2014

LAYER 3

Address and subnet planning

Proven scale at big web shops

Error detection only on TCP header

Equal cost multi-path (ECMP)

Reasonable for inter-site connectivity

24

Thursday, January 16, 14

Page 25: SF Ceph Users Jan. 2014

Public TopologiesThursday, January 16, 14

Page 26: SF Ceph Users Jan. 2014

CLIENT TOPOLOGIES

26

Path diversity for resiliency

Minimize network diameter

Consistent hop count to minimize net long tail latency

Ease of scaling

Tolerate adversarial traffic patterns (fan-in/fan-out)

Thursday, January 16, 14

Page 27: SF Ceph Users Jan. 2014

FOLDED CLOS

27

Sometimes called Fat Tree or Spine and Leaf

Minimum 4 fixed switches, grows to 10k+ node fabrics

Rack or cluster oversubscription possible

Non-blocking also possible

Path diversity

S

S

.... 1 2 N

S

.... 1 2 N

S

S

.... 1 2 N

S

.... 1 2 N

Thursday, January 16, 14

Page 28: SF Ceph Users Jan. 2014

Cluster TopologiesThursday, January 16, 14

Page 29: SF Ceph Users Jan. 2014

REPLICA TOPOLOGIES

29

Replica and erasure fan-out

Recovery and remap impact on cluster bandwidth

OSD peering

Backfill served from primary

Tune backfills to avoid large fan-in

Thursday, January 16, 14

Page 30: SF Ceph Users Jan. 2014

FOLDED CLOS

30

Sometimes called Fat Tree or Spine and Leaf

Minimum 4, grows to 10k+ node fabrics

Rack or cluster oversubscription possible

Non-blocking also possible

Path diversity

S

S

.... 1 2 N

S

.... 1 2 N

S

S

.... 1 2 N

S

.... 1 2 N

Thursday, January 16, 14

Page 31: SF Ceph Users Jan. 2014

N-WAY PARTIAL MESH

31

Thursday, January 16, 14

Page 32: SF Ceph Users Jan. 2014

EVALUATE

32

Replication

Erasure coding

Special purpose vs general purpose

Extra port cost

Thursday, January 16, 14

Page 33: SF Ceph Users Jan. 2014

Network HardwareThursday, January 16, 14

Page 34: SF Ceph Users Jan. 2014

Features

34

Buffer sizes

Cut through vs store and forward

Oversubscribed vs non-blocking

Automation and monitoring

Thursday, January 16, 14

Page 35: SF Ceph Users Jan. 2014

FIXED

35

Fixed switches can easily build large clusters

Easier to source

Smaller failure domains

Fixed designs have many control planes

Virtual chassis.. L3 split brain hilarity?

Thursday, January 16, 14

Page 36: SF Ceph Users Jan. 2014

LESS SKU

36

Utilize as few vendor SKUs as possible

If permitted, use same fixed switch for spine and leaf

More affordable to have spares on site or more spares

Quicker MTTR when gear is ready to go

Thursday, January 16, 14

Page 37: SF Ceph Users Jan. 2014

Thanks to our host!

37

Thursday, January 16, 14

Page 38: SF Ceph Users Jan. 2014

Kyle BaderSr. Solutions Architect

[email protected]

Thursday, January 16, 14