Post on 02-Jul-2015
description
SF BAY AREA CEPH USERS GROUP
INAUGURAL MEETUP
Thursday, January 16, 14
AGENDA
2
Network Hardware
Cluster Topologies
Public Topologies
Ceph Networking
Intro to Ceph
Thursday, January 16, 14
THE FORECAST
By 2020over 39 ZB of data will be stored.1.5 ZB are stored today.
3
THE PROBLEM
Existing systems don’t scale
Increasing cost and complexity
Need to invest in new platforms ahead of time
2010 2020
IT Storage Budget
Growth of data
4
Thursday, January 16, 14
THE SOLUTION
PAST: SCALE UP
FUTURE: SCALE OUT
5
Thursday, January 16, 14
CEPHThursday, January 16, 14
INTRO TO CEPH
7
Distributed storage system
Horizontally scalable
No single point of failure
Self healing and self managing
Runs on commodity hardware
GPLv2 License
Thursday, January 16, 14
ARCHITECTURE
8
Thursday, January 16, 14
SERVICE COMPONENTS
PAXOS for consensus Maintain cluster state Typically 3-5 nodes NOT in write path
MONITOR
9
Object storage interface Gossips with peers Data lives here
OSD
PART 1
OSD
Thursday, January 16, 14
RADOS GATEWAY
METADATA
SERVICE COMPONENTS
10
Object storage interface Gossips with peers Dynamic subtree partitioning
Provides S3/Swift compatibility Scale out
PART 2
Thursday, January 16, 14
CRUSH
Ceph uses CRUSH for data placement
Aware of cluster topography
Statistically even distribution across pool
Supports asymmetric nodes and devices
Hierarchal weighting
11
Thursday, January 16, 14
DATA PLACEMENT
12
Thursday, January 16, 14
POOLS
13
Groupings of OSDs
Both physical and logical
Volumes / Images
Hot SSD pool
Cold SATA pool
DMCrypt pool
Thursday, January 16, 14
REPLICATION
14
Original data durability mechanism
Ceph creates N replicas of each RADOS object
Uses CRUSH to determine replica placement
Required for mutable objects (RBD, CephFS)
More reasonable for smaller installations
Thursday, January 16, 14
ERASURE CODING
15
(8:4) MDS code in example
1.5x overhead
8 units of client data to write
4 parity units generated using FEC
All 12 units placed with CRUSH
8/12 total units to satisfy a read
Firefly Release
Thursday, January 16, 14
CLIENT COMPONENTS
16
Mutable object store Many language bindings Object classes
Native API
Linux Kernel CephFS client since 2.6.34 FUSE client Hadoop JNI bindings
CephFS
Thursday, January 16, 14
S3/SWIFT
Linux Kernel RBD client since 2.6.37+ KVM/QEMU integration Xen integration
Block Storage
OSDS3/Swift
CLIENT COMPONENTS
17
RESTful interfaces (HTTP) CRUD operations Usage accounting for billing
Thursday, January 16, 14
Ceph NetworkingThursday, January 16, 14
INFINIBAND
Currently only supported via IPoIB
Accelio (libxio) integration in Ceph is in early stages
Accelio supports multiple transports RDMA, TCP and Shared-Memory
Accelio supports multiple RDMA transports (IB, RoCE, iWARP)
19
Thursday, January 16, 14
ETHERNET
Tried and true
Proven at scale
Economical
Many suitable vendors
20
Thursday, January 16, 14
10GbE or 1GbE
Cost of 10GbE trending downward
White box switches turning up heat on vendors
Twinax relatively inexpensive and low power
SFP+ is versatile wrt distance
Single 10GbE for object
Dual 10GbE for block storage (public/cluster)
Bonding many 1GbE links adds lots of complexity
21
Thursday, January 16, 14
IPv4 or IPv6 Native
It’s 2014, is this really a question?
Ceph fully supports both modes of operation
Hierarchal allocation models allows “roll up” of routes
Optimal efficiency in RIB
Some tools believe the earth is flat
22
Thursday, January 16, 14
LAYER 2
Spanning tree
Switch table size
Broadcast domains (ARP)
MAC frame checksum
Storage protocols (FCoE, ATAoE)
TRILL, MLAG
Layer 2 DCI is crazy pants
Layer 2 tunneled over internet is super crazy pants
23
Thursday, January 16, 14
LAYER 3
Address and subnet planning
Proven scale at big web shops
Error detection only on TCP header
Equal cost multi-path (ECMP)
Reasonable for inter-site connectivity
24
Thursday, January 16, 14
Public TopologiesThursday, January 16, 14
CLIENT TOPOLOGIES
26
Path diversity for resiliency
Minimize network diameter
Consistent hop count to minimize net long tail latency
Ease of scaling
Tolerate adversarial traffic patterns (fan-in/fan-out)
Thursday, January 16, 14
FOLDED CLOS
27
Sometimes called Fat Tree or Spine and Leaf
Minimum 4 fixed switches, grows to 10k+ node fabrics
Rack or cluster oversubscription possible
Non-blocking also possible
Path diversity
S
S
.... 1 2 N
S
.... 1 2 N
S
S
.... 1 2 N
S
.... 1 2 N
Thursday, January 16, 14
Cluster TopologiesThursday, January 16, 14
REPLICA TOPOLOGIES
29
Replica and erasure fan-out
Recovery and remap impact on cluster bandwidth
OSD peering
Backfill served from primary
Tune backfills to avoid large fan-in
Thursday, January 16, 14
FOLDED CLOS
30
Sometimes called Fat Tree or Spine and Leaf
Minimum 4, grows to 10k+ node fabrics
Rack or cluster oversubscription possible
Non-blocking also possible
Path diversity
S
S
.... 1 2 N
S
.... 1 2 N
S
S
.... 1 2 N
S
.... 1 2 N
Thursday, January 16, 14
N-WAY PARTIAL MESH
31
Thursday, January 16, 14
EVALUATE
32
Replication
Erasure coding
Special purpose vs general purpose
Extra port cost
Thursday, January 16, 14
Network HardwareThursday, January 16, 14
Features
34
Buffer sizes
Cut through vs store and forward
Oversubscribed vs non-blocking
Automation and monitoring
Thursday, January 16, 14
FIXED
35
Fixed switches can easily build large clusters
Easier to source
Smaller failure domains
Fixed designs have many control planes
Virtual chassis.. L3 split brain hilarity?
Thursday, January 16, 14
LESS SKU
36
Utilize as few vendor SKUs as possible
If permitted, use same fixed switch for spine and leaf
More affordable to have spares on site or more spares
Quicker MTTR when gear is ready to go
Thursday, January 16, 14
Thanks to our host!
37
Thursday, January 16, 14
Kyle BaderSr. Solutions Architect
kyle@inktank.com
Thursday, January 16, 14