Storage tiering and erasure coding in Ceph (SCaLE13x)

79
ERASURE CODING AND CACHE TIERING SAGE WEIL – SCALE13X - 2015.02.22

Transcript of Storage tiering and erasure coding in Ceph (SCaLE13x)

Page 1: Storage tiering and erasure coding in Ceph (SCaLE13x)

ERASURE CODING AND CACHE TIERING

SAGE WEIL – SCALE13X - 2015.02.22

Page 2: Storage tiering and erasure coding in Ceph (SCaLE13x)

2

AGENDA

● Ceph architectural overview

● RADOS background

● Cache tiering

● Erasure coding

● Project status, roadmap

Page 3: Storage tiering and erasure coding in Ceph (SCaLE13x)

ARCHITECTURE

Page 4: Storage tiering and erasure coding in Ceph (SCaLE13x)

4

CEPH MOTIVATING PRINCIPLES

● All components must scale horizontally

● There can be no single point of failure

● The solution must be hardware agnostic

● Should use commodity hardware

● Self-manage whenever possible

● Open source (LGPL)

● Move beyond legacy approaches

– Client/cluster instead of client/server

– Ad hoc HA

Page 5: Storage tiering and erasure coding in Ceph (SCaLE13x)

5

CEPH COMPONENTS

RGWA web services

gateway for object storage, compatible

with S3 and Swift

LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)

RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors

RBDA reliable, fully-distributed block device with cloud

platform integration

CEPHFSA distributed file

system with POSIX semantics and scale-

out metadata management

APP HOST/VM CLIENT

Page 6: Storage tiering and erasure coding in Ceph (SCaLE13x)

ROBUST SERVICES BUILT ON RADOS

Page 7: Storage tiering and erasure coding in Ceph (SCaLE13x)

7

ARCHITECTURAL COMPONENTS

RGWA web services

gateway for object storage, compatible

with S3 and Swift

LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)

RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors

RBDA reliable, fully-distributed block device with cloud

platform integration

CEPHFSA distributed file

system with POSIX semantics and scale-

out metadata management

APP HOST/VM CLIENT

Page 8: Storage tiering and erasure coding in Ceph (SCaLE13x)

8

THE RADOS GATEWAY

M M

M

RADOS CLUSTER

RADOSGWLIBRADOS

socket

RADOSGWLIBRADOS

APPLICATION APPLICATION

REST

Page 9: Storage tiering and erasure coding in Ceph (SCaLE13x)

9

MULTI-SITE OBJECT STORAGE

WEB APPLICATION

APP SERVER

CEPH OBJECT GATEWAY

(RGW)

CEPH STORAGE CLUSTER

(US-EAST)

WEB APPLICATION

APP SERVER

CEPH OBJECT GATEWAY

(RGW)

CEPH STORAGE CLUSTER

(EU-WEST)

Page 10: Storage tiering and erasure coding in Ceph (SCaLE13x)

11

RADOSGW MAKES RADOS WEBBY

RADOSGW: REST-based object storage proxy Uses RADOS to store objects

● Stripes large RESTful objects across many RADOS objects

API supports buckets, accounts Usage accounting for billing Compatible with S3 and Swift applications

Page 11: Storage tiering and erasure coding in Ceph (SCaLE13x)

12

ARCHITECTURAL COMPONENTS

RGWA web services

gateway for object storage, compatible

with S3 and Swift

LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)

RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors

RBDA reliable, fully-distributed block device with cloud

platform integration

CEPHFSA distributed file

system with POSIX semantics and scale-

out metadata management

APP HOST/VM CLIENT

Page 12: Storage tiering and erasure coding in Ceph (SCaLE13x)

13

STORING VIRTUAL DISKS

M M

RADOS CLUSTER

HYPERVISORLIBRBD

VM

Page 13: Storage tiering and erasure coding in Ceph (SCaLE13x)

14

KERNEL MODULE

M M

RADOS CLUSTER

LINUX HOSTKRBD

Page 14: Storage tiering and erasure coding in Ceph (SCaLE13x)

15

RBD FEATURES

● Stripe images across entire cluster (pool)

● Read-only snapshots

● Copy-on-write clones

● Broad integration

– Qemu

– Linux kernel

– iSCSI (STGT, LIO)

– OpenStack, CloudStack, Nebula, Ganeti, Proxmox

● Incremental backup (relative to snapshots)

Page 15: Storage tiering and erasure coding in Ceph (SCaLE13x)

16

ARCHITECTURAL COMPONENTS

RGWA web services

gateway for object storage, compatible

with S3 and Swift

LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)

RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors

RBDA reliable, fully-distributed block device with cloud

platform integration

CEPHFSA distributed file

system with POSIX semantics and scale-

out metadata management

APP HOST/VM CLIENT

Page 16: Storage tiering and erasure coding in Ceph (SCaLE13x)

17

SEPARATE METADATA SERVER

LINUX HOST

M M

M

RADOS CLUSTER

KERNEL MODULE

datametadata 0110

Page 17: Storage tiering and erasure coding in Ceph (SCaLE13x)

18

SCALABLE METADATA SERVERS

METADATA SERVER Manages metadata for a POSIX-compliant

shared filesystem Directory hierarchy File metadata (owner, timestamps,

mode, etc.) Snapshots on any directory

Clients stripe file data in RADOS MDS not in data path

MDS stores metadata in RADOS Dynamic MDS cluster scales to 10s or

100s Only required for shared filesystem

Page 18: Storage tiering and erasure coding in Ceph (SCaLE13x)

RADOS

Page 19: Storage tiering and erasure coding in Ceph (SCaLE13x)

20

ARCHITECTURAL COMPONENTS

RGWA web services

gateway for object storage, compatible

with S3 and Swift

LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)

RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors

RBDA reliable, fully-distributed block device with cloud

platform integration

CEPHFSA distributed file

system with POSIX semantics and scale-

out metadata management

APP HOST/VM CLIENT

Page 20: Storage tiering and erasure coding in Ceph (SCaLE13x)

21

RADOS

● Flat object namespace within each pool

● Rich object API (librados)

– Bytes, attributes, key/value data

– Partial overwrite of existing data (mutable objects)

– Single-object compound operations

– RADOS classes (stored procedures)

● Strong consistency (CP system)

● Infrastructure aware, dynamic topology

● Hash-based placement (CRUSH)

● Direct client to server data path

Page 21: Storage tiering and erasure coding in Ceph (SCaLE13x)

22

RADOS CLUSTER

APPLICATION

M M

M M

M

RADOS CLUSTER

Page 22: Storage tiering and erasure coding in Ceph (SCaLE13x)

23

RADOS COMPONENTS

OSDs: 10s to 1000s in a cluster One per disk (or one per SSD, RAID

group…) Serve stored objects to clients Intelligently peer for replication & recovery

Monitors: Maintain cluster membership and state Provide consensus for distributed decision-

making Small, odd number (e.g., 5) Not part of data path

M

Page 23: Storage tiering and erasure coding in Ceph (SCaLE13x)

24

OBJECT STORAGE DAEMONS

FS

DISK

OSD

DISK

OSD

FS

DISK

OSD

FS

DISK

OSD

FS

xfsbtrfsext4

M

M

M

Page 24: Storage tiering and erasure coding in Ceph (SCaLE13x)

DATA PLACEMENT

Page 25: Storage tiering and erasure coding in Ceph (SCaLE13x)

26

WHERE DO OBJECTS LIVE?

??APPLICATION

M

M

M

OBJECT

Page 26: Storage tiering and erasure coding in Ceph (SCaLE13x)

27

A METADATA SERVER?

1

APPLICATION

M

M

M

2

Page 27: Storage tiering and erasure coding in Ceph (SCaLE13x)

28

CALCULATED PLACEMENT

FAPPLICATION

M

M

MA-G

H-N

O-T

U-Z

Page 28: Storage tiering and erasure coding in Ceph (SCaLE13x)

29

CRUSH

CLUSTER

OBJECTS

10

01

01

10

10

01

11

01

10

01

01

10

10

01 11

01

1001

0110 10 01

11

01

PLACEMENT GROUPS(PGs)

Page 29: Storage tiering and erasure coding in Ceph (SCaLE13x)

30

CRUSH IS A QUICK CALCULATION

RADOS CLUSTER

OBJECT

10

01

01

10

10

01 11

01

1001

0110 10 01

11

01

Page 30: Storage tiering and erasure coding in Ceph (SCaLE13x)

31

CRUSH AVOIDS FAILED DEVICES

RADOS CLUSTER

OBJECT

10

01

01

10

10

01 11

01

1001

0110 10 01

11

01

10

Page 31: Storage tiering and erasure coding in Ceph (SCaLE13x)

32

CRUSH: DECLUSTERED PLACEMENT

RADOS CLUSTER

10

01

01

10

10

01 11

01

1001

0110 10 01

11

01

32● Each PG independently maps to a

pseudorandom set of OSDs

● PGs that map to the same OSD generally have replicas that do not

● When an OSD fails, each PG it stored will generally be re-replicated by a different OSD

– Highly parallel recovery

– Avoid single-disk recovery bottleneck10

01

Page 32: Storage tiering and erasure coding in Ceph (SCaLE13x)

33

CRUSH: DYNAMIC DATA PLACEMENT

CRUSH: Pseudo-random placement algorithm

Fast calculation, no lookup Repeatable, deterministic

Statistically uniform distribution Stable mapping

Limited data migration on change Rule-based configuration

Infrastructure topology aware Adjustable replication Weighted devices (different sizes)

Page 33: Storage tiering and erasure coding in Ceph (SCaLE13x)

34

DATA IS ORGANIZED INTO POOLS

CLUSTER

OBJECTS

10

01

01

10

10

01 11

01

1001

0110 10 01

11

01

POOLS(CONTAINING PGs)

10

01

11

01

10

01

01

10

01

10

10

01

11

01

10

01

10 01 10 11

01

11

01

10

10

01

01

01

10

10

01

01

POOLA

POOLB

POOL C

POOLDOBJECTS

OBJECTS

OBJECTS

Page 34: Storage tiering and erasure coding in Ceph (SCaLE13x)

TIERED STORAGE

Page 35: Storage tiering and erasure coding in Ceph (SCaLE13x)

36

TWO WAYS TO CACHE

● Within each OSD

– Combine SSD and HDD under each OSD

– Make localized promote/demote decisions

– Leverage existing tools

● dm-cache, bcache, FlashCache● Variety of caching controllers

– We can help with hints

● Cache on separate devices/nodes

– Different hardware for different tiers

● Slow nodes for cold data● High performance nodes for hot data

– Add, remove, scale each tier independently

● Unlikely to choose right ratios at procurement time

HDD

OSD

BLOCKDEV

SSD

FS

Page 36: Storage tiering and erasure coding in Ceph (SCaLE13x)

37

TIERED STORAGE

APPLICATION

CACHE POOL (REPLICATED)

BACKING POOL (ERASURE CODED)

CEPH STORAGE CLUSTER

Page 37: Storage tiering and erasure coding in Ceph (SCaLE13x)

38

RADOS TIERING PRINCIPLES

● Each tier is a RADOS pool

– May be replicated or erasure coded

● Tiers are durable

– e.g., replicate across SSDs in multiple hosts

● Each tier has its own CRUSH policy

– e.g., map cache pool to SSDs devices/hosts only

● librados adapts to tiering topology

– Transparently direct requests accordingly

● e.g., to cache

– No changes to RBD, RGW, CephFS, etc.

Page 38: Storage tiering and erasure coding in Ceph (SCaLE13x)

39

READ (CACHE HIT)

CEPH CLIENT

CACHE POOL (SSD): WRITEBACK

BACKING POOL (HDD)

CEPH STORAGE CLUSTER

READ READ REPLY

Page 39: Storage tiering and erasure coding in Ceph (SCaLE13x)

40

READ (CACHE MISS)

CEPH CLIENT

CACHE POOL (SSD): WRITEBACK

BACKING POOL (HDD)

CEPH STORAGE CLUSTER

READ READ REPLY

PROXY READ

Page 40: Storage tiering and erasure coding in Ceph (SCaLE13x)

42

READ (CACHE MISS)

CEPH CLIENT

CACHE POOL (SSD): WRITEBACK

BACKING POOL (HDD)

CEPH STORAGE CLUSTER

READ

PROMOTE

READ REPLY

Page 41: Storage tiering and erasure coding in Ceph (SCaLE13x)

43

WRITE (HIT)

CEPH CLIENT

CACHE POOL (SSD): WRITEBACK

BACKING POOL (HDD)

CEPH STORAGE CLUSTER

WRITE ACK

Page 42: Storage tiering and erasure coding in Ceph (SCaLE13x)

44

WRITE (MISS)

CEPH CLIENT

CACHE POOL (SSD): WRITEBACK

BACKING POOL (HDD)

CEPH STORAGE CLUSTER

WRITE ACK

PROMOTE

Page 43: Storage tiering and erasure coding in Ceph (SCaLE13x)

45

WRITE (MISS) (COMING SOON)

CEPH CLIENT

CACHE POOL (SSD): WRITEBACK

BACKING POOL (HDD)

CEPH STORAGE CLUSTER

WRITE ACK

PROXY WRITE

Page 44: Storage tiering and erasure coding in Ceph (SCaLE13x)

46

ESTIMATING TEMPERATURE

● Each PG constructs in-memory bloom filters

– Insert records on both read and write

– Each filter covers configurable period (e.g., 1 hour)

– Tunable false positive probability (e.g., 5%)

– Store most recent N periods on disk (e.g., last 24 hours)

● Estimate temperature

– Has object been accessed in any of the last N periods?

– ...in how many of them?

– Informs the flush/evict decision

● Estimate “recency”

– How many periods since the object hasn't been accessed?

– Informs read miss behavior: proxy vs promote

Page 45: Storage tiering and erasure coding in Ceph (SCaLE13x)

47

FLUSH AND/OR EVICTCOLD DATA

CEPH CLIENT

CACHE POOL (SSD): WRITEBACK

BACKING POOL (HDD)

CEPH STORAGE CLUSTER

FLUSH ACK EVICT

Page 46: Storage tiering and erasure coding in Ceph (SCaLE13x)

48

TIERING AGENT

● Each PG has an internal tiering agent

– Manages PG based on administrator defined policy

● Flush dirty objects

– When pool reaches target dirty ratio

– Tries to select cold objects

– Marks objects clean when they have been written back to the base pool

● Evict (delete) clean objects

– Greater “effort” as cache pool approaches target size

Page 47: Storage tiering and erasure coding in Ceph (SCaLE13x)

49

CACHE TIER USAGE

● Cache tier should be faster than the base tier

● Cache tier should be replicated (not erasure coded)

● Promote and flush are expensive

– Best results when object temperature are skewed

● Most I/O goes to small number of hot objects

– Cache should be big enough to capture most of the acting set

● Challenging to benchmark

– Need a realistic workload (e.g., not 'dd') to determine how it will perform in practice

– Takes a long time to “warm up” the cache

Page 48: Storage tiering and erasure coding in Ceph (SCaLE13x)

ERASURE CODING

Page 49: Storage tiering and erasure coding in Ceph (SCaLE13x)

52

ERASURE CODING

OBJECT

REPLICATED POOL

CEPH STORAGE CLUSTER

ERASURE CODED POOL

CEPH STORAGE CLUSTER

COPY COPY

OBJECT

31 2 X Y

COPY4

Full copies of stored objects Very high durability 3x (200% overhead) Quicker recovery

One copy plus parity Cost-effective durability 1.5x (50% overhead) Expensive recovery

Page 50: Storage tiering and erasure coding in Ceph (SCaLE13x)

53

ERASURE CODING SHARDS

CEPH STORAGE CLUSTER

OBJECT

Y

OSD

3

OSD

2

OSD

1

OSD

4

OSD

X

OSD

ERASURE CODED POOL

Page 51: Storage tiering and erasure coding in Ceph (SCaLE13x)

54

ERASURE CODING SHARDS

CEPH STORAGE CLUSTER

Y

OSD

3

OSD

2

OSD

1

OSD

4

OSD

X

OSD

0

4

8

12

16

1

5

9

13

17

2

6

10

14

18

3

7

9

15

19

A

B

C

D

E

A'

B'

C'

D'

E'

● Variable stripe size (e.g., 4 KB)

● Zero-fill shards (logically) in partial tail stripe

Page 52: Storage tiering and erasure coding in Ceph (SCaLE13x)

55

PRIMARY COORDINATES

CEPH STORAGE CLUSTER

Y

OSD

3

OSD

2

OSD

1

OSD

4

OSD

X

OSD

ERASURE CODED POOL

Page 53: Storage tiering and erasure coding in Ceph (SCaLE13x)

56

EC READ

CEPH STORAGE CLUSTER

Y

OSD

3

OSD

2

OSD

1

OSD

4

OSD

X

OSD

ERASURE CODED POOL

CEPH CLIENT

READ

Page 54: Storage tiering and erasure coding in Ceph (SCaLE13x)

57

EC READ

CEPH STORAGE CLUSTER

Y

OSD

3

OSD

2

OSD

1

OSD

4

OSD

X

OSD

ERASURE CODED POOL

CEPH CLIENT

READ

READS

Page 55: Storage tiering and erasure coding in Ceph (SCaLE13x)

58

EC READ

CEPH STORAGE CLUSTER

Y

OSD

3

OSD

2

OSD

1

OSD

4

OSD

X

OSD

ERASURE CODED POOL

CEPH CLIENT

READ REPLY

Page 56: Storage tiering and erasure coding in Ceph (SCaLE13x)

59

EC WRITE

CEPH STORAGE CLUSTER

Y

OSD

3

OSD

2

OSD

1

OSD

4

OSD

X

OSD

ERASURE CODED POOL

CEPH CLIENT

WRITE

Page 57: Storage tiering and erasure coding in Ceph (SCaLE13x)

60

EC WRITE

CEPH STORAGE CLUSTER

Y

OSD

3

OSD

2

OSD

1

OSD

4

OSD

X

OSD

ERASURE CODED POOL

CEPH CLIENT

WRITE

WRITES

Page 58: Storage tiering and erasure coding in Ceph (SCaLE13x)

61

EC WRITE

CEPH STORAGE CLUSTER

Y

OSD

3

OSD

2

OSD

1

OSD

4

OSD

X

OSD

ERASURE CODED POOL

CEPH CLIENT

WRITE ACK

Page 59: Storage tiering and erasure coding in Ceph (SCaLE13x)

62

EC WRITE: DEGRADED

CEPH STORAGE CLUSTER

Y

OSD

3

OSD

2

OSD

1

OSD

4

OSD

X

OSD

ERASURE CODED POOL

CEPH CLIENT

WRITE

WRITES

Page 60: Storage tiering and erasure coding in Ceph (SCaLE13x)

63

EC WRITE: PARTIAL FAILURE

CEPH STORAGE CLUSTER

Y

OSD

3

OSD

2

OSD

1

OSD

4

OSD

X

OSD

ERASURE CODED POOL

CEPH CLIENT

WRITE

WRITES

Page 61: Storage tiering and erasure coding in Ceph (SCaLE13x)

64

EC WRITE: PARTIAL FAILURE

CEPH STORAGE CLUSTER

Y

OSD

3

OSD

2

OSD

1

OSD

4

OSD

X

OSD

ERASURE CODED POOL

CEPH CLIENT

B B BA A A

Page 62: Storage tiering and erasure coding in Ceph (SCaLE13x)

65

EC RESTRICTIONS

● Overwrite in place will not work in general

● Log and 2PC would increase complexity, latency

● We chose to restrict allowed operations

– create

– append (on stripe boundary)

– remove (keep previous generation of object for some time)

● These operations can all easily be rolled back locally

– create → delete

– append → truncate

– remove → roll back to previous generation

● Object attrs preserved in existing PG logs (they are small)

● Key/value data is not allowed on EC pools

Page 63: Storage tiering and erasure coding in Ceph (SCaLE13x)

66

EC WRITE: PARTIAL FAILURE

CEPH STORAGE CLUSTER

Y

OSD

3

OSD

2

OSD

1

OSD

4

OSD

X

OSD

ERASURE CODED POOL

CEPH CLIENT

B B BA A A

Page 64: Storage tiering and erasure coding in Ceph (SCaLE13x)

67

EC WRITE: PARTIAL FAILURE

CEPH STORAGE CLUSTER

Y

OSD

3

OSD

2

OSD

1

OSD

4

OSD

X

OSD

ERASURE CODED POOL

CEPH CLIENT

A A AA A A

Page 65: Storage tiering and erasure coding in Ceph (SCaLE13x)

68

EC RESTRICTIONS

● This is a small subset of allowed librados operations

– Notably cannot (over)write any extent

● Coincidentally, unsupported operations are also inefficient for erasure codes

– Generally require read/modify/write of affected stripe(s)

● Some can consume EC directly

– RGW (no object data update in place)

● Others can combine EC with a cache tier (RBD, CephFS)

– Replication for warm/hot data

– Erasure coding for cold data

– Tiering agent skips objects with key/value data

Page 66: Storage tiering and erasure coding in Ceph (SCaLE13x)

69

WHICH ERASURE CODE?

● The EC algorithm and implementation are pluggable

– jerasure/gf-complete (free, open, and very fast)

– ISA-L (Intel library; optimized for modern Intel procs)

– LRC (local recovery code – layers over existing plugins)

– SHEC (trades extra storage for recovery efficiency – new from Fujitsu)

● Parameterized

– Pick “k” and “m”, stripe size

● OSD handles data path, placement, rollback, etc.

● Erasure plugin handles

– Encode and decode math

– Given these available shards, which ones should I fetch to satisfy a read?

– Given these available shards and these missing shards, which ones should I fetch to recover?

Page 67: Storage tiering and erasure coding in Ceph (SCaLE13x)

70

COST OF RECOVERY

1 TB OSD

Page 68: Storage tiering and erasure coding in Ceph (SCaLE13x)

71

COST OF RECOVERY

1 TB OSD

Page 69: Storage tiering and erasure coding in Ceph (SCaLE13x)

72

COST OF RECOVERY (REPLICATION)

1 TB OSD

1 TB

Page 70: Storage tiering and erasure coding in Ceph (SCaLE13x)

73

COST OF RECOVERY (REPLICATION)

1 TB OSD

.01 TB

.01 TB

.01 TB

.01 TB

...

...

.01 TB .01 TB

Page 71: Storage tiering and erasure coding in Ceph (SCaLE13x)

74

COST OF RECOVERY (REPLICATION)

1 TB OSD

1 TB

Page 72: Storage tiering and erasure coding in Ceph (SCaLE13x)

75

COST OF RECOVERY (EC)

1 TB OSD

1 TB

1 TB

1 TB

1 TB

Page 73: Storage tiering and erasure coding in Ceph (SCaLE13x)

76

LOCAL RECOVERY CODE (LRC)

CEPH STORAGE CLUSTER

Y

OSD

3

OSD

2

OSD

1

OSD

4

OSD

X

OSD

ERASURE CODED POOL

A

OSD

C

OSD

B

OSD

OBJECT

Page 74: Storage tiering and erasure coding in Ceph (SCaLE13x)

77

BIG THANKS TO

● Ceph

– Loic Dachary (CloudWatt, FSF France, Red Hat)

– Andreas Peters (CERN)

– Sam Just (Inktank / Red Hat)

– David Zafman (Inktank / Red Hat)

● jerasure / gf-complete

– Jim Plank (University of Tennessee)

– Kevin Greenan (Box.com)

● Intel (ISL plugin)

● Fujitsu (SHEC plugin)

Page 75: Storage tiering and erasure coding in Ceph (SCaLE13x)

ROADMAP

Page 76: Storage tiering and erasure coding in Ceph (SCaLE13x)

79

WHAT'S NEXT

● Erasure coding

– Allow (optimistic) client reads directly from shards

– ARM optimizations for jerasure

● Cache pools

– Better agent decisions (when to flush or evict)

– Supporting different performance profiles

● e.g., slow / “cheap” flash can read just as fast

– Complex topologies

● Multiple readonly cache tiers in multiple sites

● Tiering

– Support “redirects” to (very) cold tier below base pool

– Enable dynamic spin-down, dedup, and other features

Page 77: Storage tiering and erasure coding in Ceph (SCaLE13x)

80

OTHER ONGOING WORK

● Performance optimization (SanDisk, Intel, Mellanox)

● Alternative OSD backends

– New backend: hybrid key/value and file system

– leveldb, rocksdb, LMDB

● Messenger (network layer) improvements

– RDMA support (libxio – Mellanox)

– Event-driven TCP implementation (UnitedStack)

● CephFS

– Online consistency checking and repair tools

– Performance, robustness

● Multi-datacenter RBD, RADOS replication

Page 78: Storage tiering and erasure coding in Ceph (SCaLE13x)

81

FOR MORE INFORMATION

● http://ceph.com

● http://github.com/ceph

● http://tracker.ceph.com

● Mailing lists

[email protected]

[email protected]

● irc.oftc.net

– #ceph

– #ceph-devel

● Twitter

– @ceph

Page 79: Storage tiering and erasure coding in Ceph (SCaLE13x)

THANK YOU!

Sage WeilCEPH PRINCIPAL ARCHITECT

[email protected]

@liewegas