SF Ceph Users Jan. 2014

Post on 02-Jul-2015

740 views 1 download

description

Ceph, being a distributed storage system, is highly reliant on the network for resiliency and performance. In addition, it is crucial that the network topology beneath a Ceph cluster be designed in such a way to facilitate easy scaling without service disruption. After an introduction to Ceph itself this talk will dive into the design of Ceph client and cluster network topologies.

Transcript of SF Ceph Users Jan. 2014

SF BAY AREA CEPH USERS GROUP

INAUGURAL MEETUP

Thursday, January 16, 14

AGENDA

2

Network Hardware

Cluster Topologies

Public Topologies

Ceph Networking

Intro to Ceph

Thursday, January 16, 14

THE FORECAST

By 2020over 39 ZB of data will be stored.1.5 ZB are stored today.

3

THE PROBLEM

Existing systems don’t scale

Increasing cost and complexity

Need to invest in new platforms ahead of time

2010 2020

IT Storage Budget

Growth of data

4

Thursday, January 16, 14

THE SOLUTION

PAST: SCALE UP

FUTURE: SCALE OUT

5

Thursday, January 16, 14

CEPHThursday, January 16, 14

INTRO TO CEPH

7

Distributed storage system

Horizontally scalable

No single point of failure

Self healing and self managing

Runs on commodity hardware

GPLv2 License

Thursday, January 16, 14

ARCHITECTURE

8

Thursday, January 16, 14

SERVICE COMPONENTS

PAXOS for consensus Maintain cluster state Typically 3-5 nodes NOT in write path

MONITOR

9

Object storage interface Gossips with peers Data lives here

OSD

PART 1

OSD

Thursday, January 16, 14

RADOS GATEWAY

METADATA

SERVICE COMPONENTS

10

Object storage interface Gossips with peers Dynamic subtree partitioning

Provides S3/Swift compatibility Scale out

PART 2

Thursday, January 16, 14

CRUSH

Ceph uses CRUSH for data placement

Aware of cluster topography

Statistically even distribution across pool

Supports asymmetric nodes and devices

Hierarchal weighting

11

Thursday, January 16, 14

DATA PLACEMENT

12

Thursday, January 16, 14

POOLS

13

Groupings of OSDs

Both physical and logical

Volumes / Images

Hot SSD pool

Cold SATA pool

DMCrypt pool

Thursday, January 16, 14

REPLICATION

14

Original data durability mechanism

Ceph creates N replicas of each RADOS object

Uses CRUSH to determine replica placement

Required for mutable objects (RBD, CephFS)

More reasonable for smaller installations

Thursday, January 16, 14

ERASURE CODING

15

(8:4) MDS code in example

1.5x overhead

8 units of client data to write

4 parity units generated using FEC

All 12 units placed with CRUSH

8/12 total units to satisfy a read

Firefly Release

Thursday, January 16, 14

CLIENT COMPONENTS

16

Mutable object store Many language bindings Object classes

Native API

Linux Kernel CephFS client since 2.6.34 FUSE client Hadoop JNI bindings

CephFS

Thursday, January 16, 14

S3/SWIFT

Linux Kernel RBD client since 2.6.37+ KVM/QEMU integration Xen integration

Block Storage

OSDS3/Swift

CLIENT COMPONENTS

17

RESTful interfaces (HTTP) CRUD operations Usage accounting for billing

Thursday, January 16, 14

Ceph NetworkingThursday, January 16, 14

INFINIBAND

Currently only supported via IPoIB

Accelio (libxio) integration in Ceph is in early stages

Accelio supports multiple transports RDMA, TCP and Shared-Memory

Accelio supports multiple RDMA transports (IB, RoCE, iWARP)

19

Thursday, January 16, 14

ETHERNET

Tried and true

Proven at scale

Economical

Many suitable vendors

20

Thursday, January 16, 14

10GbE or 1GbE

Cost of 10GbE trending downward

White box switches turning up heat on vendors

Twinax relatively inexpensive and low power

SFP+ is versatile wrt distance

Single 10GbE for object

Dual 10GbE for block storage (public/cluster)

Bonding many 1GbE links adds lots of complexity

21

Thursday, January 16, 14

IPv4 or IPv6 Native

It’s 2014, is this really a question?

Ceph fully supports both modes of operation

Hierarchal allocation models allows “roll up” of routes

Optimal efficiency in RIB

Some tools believe the earth is flat

22

Thursday, January 16, 14

LAYER 2

Spanning tree

Switch table size

Broadcast domains (ARP)

MAC frame checksum

Storage protocols (FCoE, ATAoE)

TRILL, MLAG

Layer 2 DCI is crazy pants

Layer 2 tunneled over internet is super crazy pants

23

Thursday, January 16, 14

LAYER 3

Address and subnet planning

Proven scale at big web shops

Error detection only on TCP header

Equal cost multi-path (ECMP)

Reasonable for inter-site connectivity

24

Thursday, January 16, 14

Public TopologiesThursday, January 16, 14

CLIENT TOPOLOGIES

26

Path diversity for resiliency

Minimize network diameter

Consistent hop count to minimize net long tail latency

Ease of scaling

Tolerate adversarial traffic patterns (fan-in/fan-out)

Thursday, January 16, 14

FOLDED CLOS

27

Sometimes called Fat Tree or Spine and Leaf

Minimum 4 fixed switches, grows to 10k+ node fabrics

Rack or cluster oversubscription possible

Non-blocking also possible

Path diversity

S

S

.... 1 2 N

S

.... 1 2 N

S

S

.... 1 2 N

S

.... 1 2 N

Thursday, January 16, 14

Cluster TopologiesThursday, January 16, 14

REPLICA TOPOLOGIES

29

Replica and erasure fan-out

Recovery and remap impact on cluster bandwidth

OSD peering

Backfill served from primary

Tune backfills to avoid large fan-in

Thursday, January 16, 14

FOLDED CLOS

30

Sometimes called Fat Tree or Spine and Leaf

Minimum 4, grows to 10k+ node fabrics

Rack or cluster oversubscription possible

Non-blocking also possible

Path diversity

S

S

.... 1 2 N

S

.... 1 2 N

S

S

.... 1 2 N

S

.... 1 2 N

Thursday, January 16, 14

N-WAY PARTIAL MESH

31

Thursday, January 16, 14

EVALUATE

32

Replication

Erasure coding

Special purpose vs general purpose

Extra port cost

Thursday, January 16, 14

Network HardwareThursday, January 16, 14

Features

34

Buffer sizes

Cut through vs store and forward

Oversubscribed vs non-blocking

Automation and monitoring

Thursday, January 16, 14

FIXED

35

Fixed switches can easily build large clusters

Easier to source

Smaller failure domains

Fixed designs have many control planes

Virtual chassis.. L3 split brain hilarity?

Thursday, January 16, 14

LESS SKU

36

Utilize as few vendor SKUs as possible

If permitted, use same fixed switch for spine and leaf

More affordable to have spares on site or more spares

Quicker MTTR when gear is ready to go

Thursday, January 16, 14

Thanks to our host!

37

Thursday, January 16, 14

Kyle BaderSr. Solutions Architect

kyle@inktank.com

Thursday, January 16, 14