GlusterFS Internals and - Red Hat€¦ · GlusterFS Internals and Directions Jeff Darcy Principal...

27

Transcript of GlusterFS Internals and - Red Hat€¦ · GlusterFS Internals and Directions Jeff Darcy Principal...

Page 1: GlusterFS Internals and - Red Hat€¦ · GlusterFS Internals and Directions Jeff Darcy Principal Engineer, Red Hat 13 June, 2013. GlusterFS is not a filesystem. Wait . . . what?
Page 2: GlusterFS Internals and - Red Hat€¦ · GlusterFS Internals and Directions Jeff Darcy Principal Engineer, Red Hat 13 June, 2013. GlusterFS is not a filesystem. Wait . . . what?

GlusterFS Internals and

Directions

Jeff Darcy Principal Engineer, Red Hat13 June, 2013

Page 3: GlusterFS Internals and - Red Hat€¦ · GlusterFS Internals and Directions Jeff Darcy Principal Engineer, Red Hat 13 June, 2013. GlusterFS is not a filesystem. Wait . . . what?

GlusterFSis not

a filesystem

Page 4: GlusterFS Internals and - Red Hat€¦ · GlusterFS Internals and Directions Jeff Darcy Principal Engineer, Red Hat 13 June, 2013. GlusterFS is not a filesystem. Wait . . . what?

Wait . . . what?

● GlusterFS is a scalable general purpose storage platform

● We handle common storage tasks● cluster management and configuration● data distribution and replication● common control and data structures

● That platform can be used many different ways

Page 5: GlusterFS Internals and - Red Hat€¦ · GlusterFS Internals and Directions Jeff Darcy Principal Engineer, Red Hat 13 June, 2013. GlusterFS is not a filesystem. Wait . . . what?

Interface Possibilities

qemu

NFS

SMB

Hadoop

FUSE

Cinder

Swift (UFO)

Files Blocks

Objects

libgfapi

Whatever

IP RDMA

Transports

files BD

Back ends

DB

Page 6: GlusterFS Internals and - Red Hat€¦ · GlusterFS Internals and Directions Jeff Darcy Principal Engineer, Red Hat 13 June, 2013. GlusterFS is not a filesystem. Wait . . . what?

OpenStack and GlusterFS – Current Integration

Glance Images

NovaNodes

SwiftObjects

Cinder Data

Glance Data

Swift Data

Swift API

Storage Server

Storage Server

Storage Server…

KVM

KVM

KVM

● Separate Compute and Storage Pools

● GlusterFS directly provides Swift object service

● Integration with Keystone● GeoReplication for multi-site

support● Swift data also available via

other protocols● Supports non-OpenStack use in

addition to OpenStack use

Logical View Physical View

Page 7: GlusterFS Internals and - Red Hat€¦ · GlusterFS Internals and Directions Jeff Darcy Principal Engineer, Red Hat 13 June, 2013. GlusterFS is not a filesystem. Wait . . . what?

OpenStack and GlusterFS - Future Direction

HadoopGuest

OtherGuest

...

Host

GlusterGuest

HadoopGuest

OtherGuest

...

Host

GlusterGuest

NovaCompute

Nodes

Page 8: GlusterFS Internals and - Red Hat€¦ · GlusterFS Internals and Directions Jeff Darcy Principal Engineer, Red Hat 13 June, 2013. GlusterFS is not a filesystem. Wait . . . what?

Open Stack and GlusterFS - Future Direction

● POC based on proposed OpenStack FaaS (File as a Service) proposal

● Cinder-like virtual NAS service● Tenant-specific file shares● Hypervisor mediated for security

● Avoid exposing servers to Quantum tenant network ● Optional multi-site or multi-zone GeoReplication

● FaaS data optionally available to non OpenStack nodes

● Initial focus on Linux guest

● Windows (SMB) and NFS shares also under consideration

Page 9: GlusterFS Internals and - Red Hat€¦ · GlusterFS Internals and Directions Jeff Darcy Principal Engineer, Red Hat 13 June, 2013. GlusterFS is not a filesystem. Wait . . . what?

Making Hard Stuff Easier

● Distributed filesystems are notoriously hard to set up● multiple experts for multiple weeks is “normal”

● How about four CLI commands?● probe peer, create volume, start volume, mount

● We handle cluster membership, process management, port mapping, dynamic configuration changes, etc.

● add/remove nodes on the fly● add/remove features on the fly● rolling upgrade

Page 10: GlusterFS Internals and - Red Hat€¦ · GlusterFS Internals and Directions Jeff Darcy Principal Engineer, Red Hat 13 June, 2013. GlusterFS is not a filesystem. Wait . . . what?

Q: How Do We Do It?

Distribution

Replication

...

RPC Server

...

Local Storage

LocalFS

FUSE

...

libgfapi

RPC Client

one of... ...plus all of... ...plus all of...

A: Modularity!

Page 11: GlusterFS Internals and - Red Hat€¦ · GlusterFS Internals and Directions Jeff Darcy Principal Engineer, Red Hat 13 June, 2013. GlusterFS is not a filesystem. Wait . . . what?

Deep Dive: Distribution

Distribution

Replication

...

RPC Server

...

Local Storage

LocalFS

FUSE

...

libgfapi

RPC Client

Page 12: GlusterFS Internals and - Red Hat€¦ · GlusterFS Internals and Directions Jeff Darcy Principal Engineer, Red Hat 13 June, 2013. GlusterFS is not a filesystem. Wait . . . what?

Elastic Hashing

Server A

Server BServer C

File X

File Y

● Deterministic mapping: object hash → server

Page 13: GlusterFS Internals and - Red Hat€¦ · GlusterFS Internals and Directions Jeff Darcy Principal Engineer, Red Hat 13 June, 2013. GlusterFS is not a filesystem. Wait . . . what?

Adding a Node

Server A

Server BServer C

File X

File Y

● Minimize reassignment when server set changes

Server D

Page 14: GlusterFS Internals and - Red Hat€¦ · GlusterFS Internals and Directions Jeff Darcy Principal Engineer, Red Hat 13 June, 2013. GlusterFS is not a filesystem. Wait . . . what?

Rebalancing

● Goal: optimal layout with minimal data movement

● Greatly improved algorithms in 3.4

Page 15: GlusterFS Internals and - Red Hat€¦ · GlusterFS Internals and Directions Jeff Darcy Principal Engineer, Red Hat 13 June, 2013. GlusterFS is not a filesystem. Wait . . . what?

Future: Tiering and Topology Awareness

● General deterministic matching function: file attributes to storage attributes

● Currently both attributes are hashes, but...● file attribute could be account ID, age, ...● storage attribute could be disk type (SSD), replication

level, ...● either could be an arbitrary tag

● Rebalance etc. “just work” regardless

● Algorithms can be stacked on top of one another

Page 16: GlusterFS Internals and - Red Hat€¦ · GlusterFS Internals and Directions Jeff Darcy Principal Engineer, Red Hat 13 June, 2013. GlusterFS is not a filesystem. Wait . . . what?

Tiering ExampleVolume

(select by path)

SSDs(random)

Replicated(random)

Development(random)

Production(select by age)

Page 17: GlusterFS Internals and - Red Hat€¦ · GlusterFS Internals and Directions Jeff Darcy Principal Engineer, Red Hat 13 June, 2013. GlusterFS is not a filesystem. Wait . . . what?

Deep Dive: Replication

Distribution

Replication

...

RPC Server

...

Local Storage

LocalFS

FUSE

...

libgfapi

RPC Client

Page 18: GlusterFS Internals and - Red Hat€¦ · GlusterFS Internals and Directions Jeff Darcy Principal Engineer, Red Hat 13 June, 2013. GlusterFS is not a filesystem. Wait . . . what?

Replicated Writes

Client

Server A

Server B

lock xattr+ write xattr- unlock

● Many optimizations avoid the lock/xattr ops● especially for sequential writes

● Still synchronous● don't try this on a high-latency network

Page 19: GlusterFS Internals and - Red Hat€¦ · GlusterFS Internals and Directions Jeff Darcy Principal Engineer, Red Hat 13 June, 2013. GlusterFS is not a filesystem. Wait . . . what?

Self Heal

● Generation 1: on demand

● Generation 2: full manual scan

● Generation 3: parallel, automatic repair● index based● GlusterFS 3.3, RHS 2.0

● Future: journal based● even more precise (i.e. faster)● lower overhead

Page 20: GlusterFS Internals and - Red Hat€¦ · GlusterFS Internals and Directions Jeff Darcy Principal Engineer, Red Hat 13 June, 2013. GlusterFS is not a filesystem. Wait . . . what?

Split Brain

Server A Server B

Client 1 Client 2

write“foo”

write“bar”

networkpartition

Page 21: GlusterFS Internals and - Red Hat€¦ · GlusterFS Internals and Directions Jeff Darcy Principal Engineer, Red Hat 13 June, 2013. GlusterFS is not a filesystem. Wait . . . what?

Split Brain (continued)

● In 3.3: basic quorum enforcement● client side, replica-set level● poor approach for N=2

● In 3.4: advanced quorum enforcement● server side, cluster level

● In 3.5: hyper-advanced (?) quorum enforcement● volume level● arbiters (best approach for N=2)

Page 22: GlusterFS Internals and - Red Hat€¦ · GlusterFS Internals and Directions Jeff Darcy Principal Engineer, Red Hat 13 June, 2013. GlusterFS is not a filesystem. Wait . . . what?

Access Methods (past)

Distribution

Replication

...

RPC Client

FUSE

NFS

Samba

Swift

HadoopHadoop

Page 23: GlusterFS Internals and - Red Hat€¦ · GlusterFS Internals and Directions Jeff Darcy Principal Engineer, Red Hat 13 June, 2013. GlusterFS is not a filesystem. Wait . . . what?

Access Methods (present)

Distribution

Replication

...

RPC Client

FUSE

NFS

Samba

Swift

Hadoop

qemu libgfapi

Page 24: GlusterFS Internals and - Red Hat€¦ · GlusterFS Internals and Directions Jeff Darcy Principal Engineer, Red Hat 13 June, 2013. GlusterFS is not a filesystem. Wait . . . what?

Access Methods (future)

Distribution

Replication

...

RPC Client

FUSE

NFS

Samba

Swift

HadoopHadoop

qemu libgfapi

Your API

Page 25: GlusterFS Internals and - Red Hat€¦ · GlusterFS Internals and Directions Jeff Darcy Principal Engineer, Red Hat 13 June, 2013. GlusterFS is not a filesystem. Wait . . . what?

What is libgfapi?

● User-space library for accessing data in GlusterFS

● Filesystem-like API

● Runs in application process● no FUSE, no copies, no context switches● ...but same volfiles, translators, etc.

● Could be used for Apache/nginx modules, MPI I/O (maybe), Ganesha, etc. ad infinitum

● BTW it's usable from Python too :)

Page 26: GlusterFS Internals and - Red Hat€¦ · GlusterFS Internals and Directions Jeff Darcy Principal Engineer, Red Hat 13 June, 2013. GlusterFS is not a filesystem. Wait . . . what?

Translator API

● If libgfapi isn't enough, you can write your own translators (including glupy for Python)

● Most of what we already do is in translators

● It's a public (though not well documented) API● “Translator 101” series, forge.gluster.org

● Translators are right in the I/O path

● Current examples: encryption, erasure coding

● Other possibilities: dedup/compression, format translation, indexing

Page 27: GlusterFS Internals and - Red Hat€¦ · GlusterFS Internals and Directions Jeff Darcy Principal Engineer, Red Hat 13 June, 2013. GlusterFS is not a filesystem. Wait . . . what?

http://www.gluster.org

● Modularity makes it all possible

● Expect:● OpenStackHadoopOpenStackHadoop...

● marketing made me say that

● more front-end protocols● more back-end storage options● more functionality within the I/O path● more performance enhancements

● Make the storage system you want