SF Ceph Users Jan. 2014

SF BAY AREA CEPH USERS GROUP

INAUGURAL MEETUP

Thursday, January 16, 14

AGENDA

Network Hardware

Cluster Topologies

Public Topologies

Ceph Networking

Intro to Ceph

THE FORECAST

By 2020over 39 ZB of data will be stored.1.5 ZB are stored today.

THE PROBLEM

Existing systems don’t scale

Increasing cost and complexity

Need to invest in new platforms ahead of time

2010 2020

IT Storage Budget

Growth of data

THE SOLUTION

PAST: SCALE UP

FUTURE: SCALE OUT

CEPHThursday, January 16, 14

INTRO TO CEPH

Distributed storage system

Horizontally scalable

No single point of failure

Self healing and self managing

Runs on commodity hardware

GPLv2 License

ARCHITECTURE

SERVICE COMPONENTS

PAXOS for consensus Maintain cluster state Typically 3-5 nodes NOT in write path

MONITOR

Object storage interface Gossips with peers Data lives here

PART 1

RADOS GATEWAY

METADATA

SERVICE COMPONENTS

Object storage interface Gossips with peers Dynamic subtree partitioning

Provides S3/Swift compatibility Scale out

PART 2

Ceph uses CRUSH for data placement

Aware of cluster topography

Statistically even distribution across pool

Supports asymmetric nodes and devices

Hierarchal weighting

DATA PLACEMENT

Groupings of OSDs

Both physical and logical

Volumes / Images

Hot SSD pool

Cold SATA pool

DMCrypt pool

REPLICATION

Original data durability mechanism

Ceph creates N replicas of each RADOS object

Uses CRUSH to determine replica placement

Required for mutable objects (RBD, CephFS)

More reasonable for smaller installations

ERASURE CODING

(8:4) MDS code in example

1.5x overhead

8 units of client data to write

4 parity units generated using FEC

All 12 units placed with CRUSH

8/12 total units to satisfy a read

Firefly Release

CLIENT COMPONENTS

Mutable object store Many language bindings Object classes

Native API

Linux Kernel CephFS client since 2.6.34 FUSE client Hadoop JNI bindings

CephFS

S3/SWIFT

Linux Kernel RBD client since 2.6.37+ KVM/QEMU integration Xen integration

Block Storage

OSDS3/Swift

CLIENT COMPONENTS

RESTful interfaces (HTTP) CRUD operations Usage accounting for billing

Ceph NetworkingThursday, January 16, 14

INFINIBAND

Currently only supported via IPoIB

Accelio (libxio) integration in Ceph is in early stages

Accelio supports multiple transports RDMA, TCP and Shared-Memory

Accelio supports multiple RDMA transports (IB, RoCE, iWARP)

ETHERNET

Tried and true

Proven at scale

Economical

Many suitable vendors

10GbE or 1GbE

Cost of 10GbE trending downward

White box switches turning up heat on vendors

Twinax relatively inexpensive and low power

SFP+ is versatile wrt distance

Single 10GbE for object

Dual 10GbE for block storage (public/cluster)

Bonding many 1GbE links adds lots of complexity

IPv4 or IPv6 Native

It’s 2014, is this really a question?

Ceph fully supports both modes of operation

Hierarchal allocation models allows “roll up” of routes

Optimal efficiency in RIB

Some tools believe the earth is flat

LAYER 2

Spanning tree

Switch table size

Broadcast domains (ARP)

MAC frame checksum

Storage protocols (FCoE, ATAoE)

TRILL, MLAG

Layer 2 DCI is crazy pants

Layer 2 tunneled over internet is super crazy pants

LAYER 3

Address and subnet planning

Proven scale at big web shops

Error detection only on TCP header

Equal cost multi-path (ECMP)

Reasonable for inter-site connectivity

Public TopologiesThursday, January 16, 14

CLIENT TOPOLOGIES

Path diversity for resiliency

Minimize network diameter

Consistent hop count to minimize net long tail latency

Ease of scaling

Tolerate adversarial traffic patterns (fan-in/fan-out)

FOLDED CLOS

Sometimes called Fat Tree or Spine and Leaf

Minimum 4 fixed switches, grows to 10k+ node fabrics

Rack or cluster oversubscription possible

Non-blocking also possible

Path diversity

.... 1 2 N

Cluster TopologiesThursday, January 16, 14

REPLICA TOPOLOGIES

Replica and erasure fan-out

Recovery and remap impact on cluster bandwidth

OSD peering

Backfill served from primary

Tune backfills to avoid large fan-in

FOLDED CLOS

Sometimes called Fat Tree or Spine and Leaf

Minimum 4, grows to 10k+ node fabrics

Rack or cluster oversubscription possible

Non-blocking also possible

Path diversity

.... 1 2 N

N-WAY PARTIAL MESH

EVALUATE

Replication

Erasure coding

Special purpose vs general purpose

Extra port cost

Network HardwareThursday, January 16, 14

Features

Buffer sizes

Cut through vs store and forward

Oversubscribed vs non-blocking

Automation and monitoring

Fixed switches can easily build large clusters

Easier to source

Smaller failure domains

Fixed designs have many control planes

Virtual chassis.. L3 split brain hilarity?

LESS SKU

Utilize as few vendor SKUs as possible

If permitted, use same fixed switch for spine and leaf

More affordable to have spares on site or more spares

Quicker MTTR when gear is ready to go

Thanks to our host!

Kyle BaderSr. Solutions Architect

kyle@inktank.com

SF Ceph Users Jan. 2014

Technology

Transcript of SF Ceph Users Jan. 2014

Ceph Day SF 2015 - Deploying flash storage for Ceph without compromising performance

Ceph Day SF 2015 - Keynote

Ceph Day NYC: Building Tomorrow's Ceph

Ceph Day NYC: Ceph Fundamentals

London Ceph Day: Ceph Performance and Optimization

Red Hat Ceph Storage 1.2.3 Red Hat Ceph Architecture · 2017. 2. 17. · Red Hat Ceph Storage 1.2.3 Red Hat Ceph Architecture 4. CHAPTER 1. STORAGE CLUSTER ARCHITECTURE A Ceph Storage

Ceph Deployment with Dell Crowbar - Ceph Day Frankfurt

Using Ceph in OStack.de - Ceph Day Frankfurt

Ceph Day Santa Clara: Keynote: Building Tomorrow's Ceph

Ceph Day New York 2014: Ceph Ecosystem Update

Ceph Day SF 2015 - Big Data Applications and Tuning in Ceph

MySQL and Ceph - Percona · Agenda • Ceph Introduction and Architecture • Why MySQL on Ceph • MySQL and Ceph Performance Tuning • Head-to-Head Performance MySQL on Ceph vs.

Ceph Day SF 2015 - Community Update

Ceph Day Amsterdam 2015 - Ceph Ecosystem Update

Keynote: Building Tomorrow's Ceph - Ceph Day Frankfurt

In-Ceph-tion: Deploying a Ceph cluster on DreamCompute

Flying Circus Ceph Case Study (CEPH Usergroup Berlin)

Ceph Day Santa Clara: Ceph Fundamentals

OpenSDS Manageability using Swordfish for Cloud-native ... · Node Hotpot Dock Ceph Driver Storage A Ceph Ceph Node Ceph Node Ceph Node Node Hotpot Dock Custom Drivers Storage X Storage

Ceph Day London 2014 - Deploying ceph in the wild