SF Ceph Users Jan. 2014

SF BAY AREA CEPH USERS GROUP

INAUGURAL MEETUP

Thursday, January 16, 14

AGENDA

2

Network Hardware

Cluster Topologies

Public Topologies

Ceph Networking

Intro to Ceph


THE FORECAST

By 2020over 39 ZB of data will be stored.1.5 ZB are stored today.

3

THE PROBLEM

Existing systems don’t scale

Increasing cost and complexity

Need to invest in new platforms ahead of time

2010 2020

IT Storage Budget

Growth of data

4


THE SOLUTION

PAST: SCALE UP

FUTURE: SCALE OUT

5


CEPHThursday, January 16, 14

INTRO TO CEPH

7

Distributed storage system

Horizontally scalable

No single point of failure

Self healing and self managing

Runs on commodity hardware

GPLv2 License


ARCHITECTURE

8


SERVICE COMPONENTS

PAXOS for consensus Maintain cluster state Typically 3-5 nodes NOT in write path

MONITOR

9

Object storage interface Gossips with peers Data lives here

OSD

PART 1

OSD


RADOS GATEWAY

METADATA

SERVICE COMPONENTS

10

Object storage interface Gossips with peers Dynamic subtree partitioning

Provides S3/Swift compatibility Scale out

PART 2


CRUSH

Ceph uses CRUSH for data placement

Aware of cluster topography

Statistically even distribution across pool

Supports asymmetric nodes and devices

Hierarchal weighting

11


DATA PLACEMENT

12


POOLS

13

Groupings of OSDs

Both physical and logical

Volumes / Images

Hot SSD pool

Cold SATA pool

DMCrypt pool


REPLICATION

14

Original data durability mechanism

Ceph creates N replicas of each RADOS object

Uses CRUSH to determine replica placement

Required for mutable objects (RBD, CephFS)

More reasonable for smaller installations


ERASURE CODING

15

(8:4) MDS code in example

1.5x overhead

8 units of client data to write

4 parity units generated using FEC

All 12 units placed with CRUSH

8/12 total units to satisfy a read

Firefly Release


CLIENT COMPONENTS

16

Mutable object store Many language bindings Object classes

Native API

Linux Kernel CephFS client since 2.6.34 FUSE client Hadoop JNI bindings

CephFS


S3/SWIFT

Linux Kernel RBD client since 2.6.37+ KVM/QEMU integration Xen integration

Block Storage

OSDS3/Swift

CLIENT COMPONENTS

17

RESTful interfaces (HTTP) CRUD operations Usage accounting for billing


Ceph NetworkingThursday, January 16, 14

INFINIBAND

Currently only supported via IPoIB

Accelio (libxio) integration in Ceph is in early stages

Accelio supports multiple transports RDMA, TCP and Shared-Memory

Accelio supports multiple RDMA transports (IB, RoCE, iWARP)

19


ETHERNET

Tried and true

Proven at scale

Economical

Many suitable vendors

20


10GbE or 1GbE

Cost of 10GbE trending downward

White box switches turning up heat on vendors

Twinax relatively inexpensive and low power

SFP+ is versatile wrt distance

Single 10GbE for object

Dual 10GbE for block storage (public/cluster)

Bonding many 1GbE links adds lots of complexity

21


IPv4 or IPv6 Native

It’s 2014, is this really a question?

Ceph fully supports both modes of operation

Hierarchal allocation models allows “roll up” of routes

Optimal efficiency in RIB

Some tools believe the earth is flat

22


LAYER 2

Spanning tree

Switch table size

Broadcast domains (ARP)

MAC frame checksum

Storage protocols (FCoE, ATAoE)

TRILL, MLAG

Layer 2 DCI is crazy pants

Layer 2 tunneled over internet is super crazy pants

23


LAYER 3

Address and subnet planning

Proven scale at big web shops

Error detection only on TCP header

Equal cost multi-path (ECMP)

Reasonable for inter-site connectivity

24


Public TopologiesThursday, January 16, 14

CLIENT TOPOLOGIES

26

Path diversity for resiliency

Minimize network diameter

Consistent hop count to minimize net long tail latency

Ease of scaling

Tolerate adversarial traffic patterns (fan-in/fan-out)


FOLDED CLOS

27

Sometimes called Fat Tree or Spine and Leaf

Minimum 4 fixed switches, grows to 10k+ node fabrics

Rack or cluster oversubscription possible

Non-blocking also possible

Path diversity

S

S

.... 1 2 N

S

.... 1 2 N

S

S

.... 1 2 N

S

.... 1 2 N


Cluster TopologiesThursday, January 16, 14

REPLICA TOPOLOGIES

29

Replica and erasure fan-out

Recovery and remap impact on cluster bandwidth

OSD peering

Backfill served from primary

Tune backfills to avoid large fan-in


FOLDED CLOS

30

Sometimes called Fat Tree or Spine and Leaf

Minimum 4, grows to 10k+ node fabrics

Rack or cluster oversubscription possible

Non-blocking also possible

Path diversity

S

S

.... 1 2 N

S

.... 1 2 N

S

S

.... 1 2 N

S

.... 1 2 N


N-WAY PARTIAL MESH

31


EVALUATE

32

Replication

Erasure coding

Special purpose vs general purpose

Extra port cost


Network HardwareThursday, January 16, 14

Features

34

Buffer sizes

Cut through vs store and forward

Oversubscribed vs non-blocking

Automation and monitoring


FIXED

35

Fixed switches can easily build large clusters

Easier to source

Smaller failure domains

Fixed designs have many control planes

Virtual chassis.. L3 split brain hilarity?


LESS SKU

36

Utilize as few vendor SKUs as possible

If permitted, use same fixed switch for spine and leaf

More affordable to have spares on site or more spares

Quicker MTTR when gear is ready to go


Thanks to our host!

37


Kyle BaderSr. Solutions Architect

[email protected]


mailto:[email protected]

mailto:[email protected]

SF Ceph Users Jan. 2014

Technology

Transcript of SF Ceph Users Jan. 2014