Use Distributed Filesystem as a Storage Tier

53
Fabrizio Manfred Furuholmen Use Distributed File system as a Storage Tier

description

Storage is one of the most important part of a data center, the complexity to design, build  and  delivering 24/forever availability service continues to increase every year. For these problems one of the best solution is a distributed filesystem (DFS) This talk describes the basic architectures of DFS and comparison among different free software solutions in order to show what makes DFS suitable for large-scale distributed environments.   We explain how to use, to deploy, advantages and disadvantages, performance and layout on each solutions.  We also introduce some Case Studies on implementations based on openAFS, GlusterFS and Hadoop finalized to build your own Cloud Storage.

Transcript of Use Distributed Filesystem as a Storage Tier

Page 1: Use Distributed Filesystem as a Storage Tier

Fabrizio Manfred Furuholmen

Use Distributed File system as a Storage Tier

Page 2: Use Distributed Filesystem as a Storage Tier

14/04/2023

2

Agenda

Introduction Next Generation Data Center Distributed File system

Distributed File system OpenAFS GlusterFS HDFS Ceph

Case Studies

Conclusion

Page 3: Use Distributed Filesystem as a Storage Tier

14/04/2023

3

Class Exam

What do you know about DFS ?

How can you create a Petabyte storage ?

How can you make a centralized system log ?

How can you allocate space for your user or system, when you have a thousands of users/systems ?

How can you retrieve data from everywhere ?

Page 4: Use Distributed Filesystem as a Storage Tier

14/04/2023

4

Introduction

Next Generation Data Center: the “FABRIC”

Key categories: Continuous data protection and disaster

recovery

File and block data migration across

heterogeneous environments

Server and storage virtualization

Encryption for data in-flight and at-rest

In other words: Cloud data center

Page 5: Use Distributed Filesystem as a Storage Tier

14/04/2023

5

Introduction

Storage Tier in the “FABRIC” High Performance Scalability Simplified Management Security High Availability

Solutions Storage Area Network Network Attached Storage Distributed file system

Page 6: Use Distributed Filesystem as a Storage Tier

14/04/2023

6

Introduction

What is a Distributed File system ?

“A distributed file system takes advantage of the

interconnected nature of the network by storing

files on more than one computer in the network

and making them accessible to all of them..”

Page 7: Use Distributed Filesystem as a Storage Tier

Introduction

7

Page 8: Use Distributed Filesystem as a Storage Tier

Part II

Implementations

8

How many DFS do you know ?

Page 9: Use Distributed Filesystem as a Storage Tier

14/04/2023

9

OpenAFS: introduction

Key ideas: Make clients do work whenever possible.

Cache whenever possible.

Exploit file usage properties. Understand them. One-third of Unix files are temporary.

Minimize system-wide knowledge and change. Do not hardwire locations.

Trust the fewest possible entities. Do not trust workstations.

Batch if possible to group operations.

is the open source implementation of Andrew File system of IBM

Page 10: Use Distributed Filesystem as a Storage Tier

14/04/2023

10

OpenAFS: design

Page 11: Use Distributed Filesystem as a Storage Tier

OpenAFS: components

Server A

Server A+B

Server C

11

Page 12: Use Distributed Filesystem as a Storage Tier

64

256

1024

4096

16384

65536

262144 4

64

1024

16384

0

5000

10000

15000

20000

25000

30000

35000

40000

kb

block

write

35000-40000

30000-35000

25000-30000

20000-25000

15000-20000

10000-15000

5000-10000

0-5000

4

16

64

256

1024

4096

16384 2048

16384

1310720

10000

20000

30000

40000

50000

60000

70000

80000

90000

a

43

read

80000-90000

70000-80000

60000-70000

50000-60000

40000-50000

30000-40000

20000-30000

10000-20000

0-10000

OpenAFS: performances

OpenAFS OpenAFS OSD 2 Servers

Page 13: Use Distributed Filesystem as a Storage Tier

14/04/2023

13

OpenAFS: features

Uniform name space: same path on all workstations

Security: base to krb4/krb5, extended ACL, traffic encryption

Reliability: read-only replication, HA database, read/write replica in OSD version

Availability: maintenance tasks without stopping the service

Scalability: server aggregation

Administration: administration delegation

Performance: client side disk base persistent cache, big rate client per Server

Page 14: Use Distributed Filesystem as a Storage Tier

openAFS: who uses it ?

14

Page 15: Use Distributed Filesystem as a Storage Tier

OpenAFS: good for ...

15

Page 16: Use Distributed Filesystem as a Storage Tier

14/04/2023

16

GlusterFS

“Gluster can manage data in a single global namespace on commodity hardware..”

Keys: Lower Storage Cost—Open source software runs on commodity

hardware

Scalability—Linearly scales to hundreds of Petabytes

Performance—No metadata server means no bottlenecks

High Availability—Data mirroring and real time self-healing

Virtual Storage for Virtual Servers—Simplifies storage and keeps VMs always-on

Simplicity—Complete web based management suite

Page 17: Use Distributed Filesystem as a Storage Tier

14/04/2023

17

GlusterFS: design

Page 18: Use Distributed Filesystem as a Storage Tier

14/04/2023

18

GlusterFS: components

volume posix1 type storage/posix option directory /home/export1end-volume

volume brick1 type features/posix-locks option mandatory subvolumes posix1end-volume

volume server type protocol/server option transport-type tcp option transport.socket.listen-port 6996 subvolumes brick1 option auth.addr.brick1.allow * end-volume

Page 19: Use Distributed Filesystem as a Storage Tier

14/04/2023

19

Gluster: components

Page 20: Use Distributed Filesystem as a Storage Tier

14/04/2023

20

Gluster: performance

Page 21: Use Distributed Filesystem as a Storage Tier

14/04/2023

21

Gluster: carateristics

Uniform name space: same path on all workstation

Reliability: read-1 replication, asynchronous replication for disaster recovery

Availability: No system downtime for maintenance (better in the next release)

Scalability: Truly linear scalability

Administration: Self Healing, Centralized logging and reporting, Appliance version

Performance: Stripe files across dozens of storage blocks, Automatic load balancing, per volume i/o tuning

Page 22: Use Distributed Filesystem as a Storage Tier

Gluster: who uses it ?

Avail TVN (USA) 400TB for Video on demand, video storage

Fido Film (Sweden)visual FX and Animation studio

University of Minnesota (USA)142TB Supercomputing

Partners Healthcare (USA)336TB Integrated health system

Origo (Switzerland)open source software development and collaboration platform

22

Page 23: Use Distributed Filesystem as a Storage Tier

Gluster: good for ...

23

Page 24: Use Distributed Filesystem as a Storage Tier

14/04/2023

24

Implementations

Implementations

Old way Metadata and data in the same place Single stream per file

New way Multiple streams are parallel channels

through which data can flow Files are striped across a set of nodes in

order to facilitate parallel access OSD Separation of file metadata

management (MDS) from the storage of file data

Page 25: Use Distributed Filesystem as a Storage Tier

14/04/2023

25

HDFS: Hadoop

HDFS is part of the Apache Hadoop project which develops open-source software for reliable, scalable, distributed computing.

Hadoop was inspired by Google’s MapReduce and Google File system

Page 26: Use Distributed Filesystem as a Storage Tier

14/04/2023

26

HDFS: Google File System

“ Design of a file systems for a different environment where assumptions of a general purpose file system do not hold—interesting to see how new assumptions lead to a different type of system…”

Key ideas: Component failures are the norm. Huge files (not just the occasional file) Append rather than overwrite is typical Co-design of application and file system API—specialization.

For example can have relaxed consistency.

Page 27: Use Distributed Filesystem as a Storage Tier

“Moving Computation is Cheaper than Moving Data”

HDFS: MapReduce

27

Page 28: Use Distributed Filesystem as a Storage Tier

HDFS: goals

28

Page 29: Use Distributed Filesystem as a Storage Tier

HDFS: design

29

Page 30: Use Distributed Filesystem as a Storage Tier

HDFS: components

30

Page 31: Use Distributed Filesystem as a Storage Tier

14/04/2023

31

HDFS: features

Uniform name space: same path on all workstations

Reliability: rw replication, re-balancing, copy in different locations

Availability: hot deploy

Scalability: server aggregation

Administration: HOD

Performance: “grid” computation, parallel transfer

Page 32: Use Distributed Filesystem as a Storage Tier

HDFS: who uses it ?

Major players

Yahoo!A9.comAOLBooz Allen HamiltonEHarmonyFacebookFreebaseFox Interactive MediaIBMImageShackISIJoostLast.fmLinkedInMetawebMeeboNingPowerset (now part of Microsoft)Proteus TechnologiesThe New York TimesRackspaceVeohTwitter…

32

Page 33: Use Distributed Filesystem as a Storage Tier

HDFS: good for ...

33

Page 34: Use Distributed Filesystem as a Storage Tier

Ceph

“Ceph is designed to handle workloads in which tens thousands of clients or more simultaneously access the same file or write to the same directory–usage scenarios that bring typical enterprise storage systems to their knees.”

Keys: Seamless scaling — The file system can be seamlessly expanded by simply

adding storage nodes (OSDs). However, unlike most existing file systems, Ceph proactively migrates data onto new devices in order to maintain a balanced distribution of data.

Strong reliability and fast recovery — All data is replicated across multiple OSDs. If any OSD fails, data is automatically re-replicated to other devices.

Adaptive MDS — The Ceph metadata server (MDS) is designed to dynamically adapt its behavior to the current workload.

34

Page 35: Use Distributed Filesystem as a Storage Tier

Ceph: design

35

Page 36: Use Distributed Filesystem as a Storage Tier

Ceph: features

36

Page 37: Use Distributed Filesystem as a Storage Tier

37

Ceph: features

Page 38: Use Distributed Filesystem as a Storage Tier

Ceph: features

38

Page 39: Use Distributed Filesystem as a Storage Tier

Ceph: good for …

39

Page 40: Use Distributed Filesystem as a Storage Tier

Others

40

Page 41: Use Distributed Filesystem as a Storage Tier

Part III

Case Studies

41

Page 42: Use Distributed Filesystem as a Storage Tier

14/04/2023

42

Class Exam

What can DFS do for you ?

How can you create a Petabyte storage ?

How can you make a centralized system log ?

How can you allocate space for your user or system, when you have a thousands of users/systems ?

How can you retrieve data from everywhere ?

Page 43: Use Distributed Filesystem as a Storage Tier

File sharing

43

Page 44: Use Distributed Filesystem as a Storage Tier

Web Service

44

Page 45: Use Distributed Filesystem as a Storage Tier

Internet Disk: myS3

45

Page 46: Use Distributed Filesystem as a Storage Tier

Log concentrator

46

Page 47: Use Distributed Filesystem as a Storage Tier

Private cloud

Page 48: Use Distributed Filesystem as a Storage Tier

14/04/2023

48

Conclusion: problems

FailureFor 10 PB of storage, you will have an average of 22 consumer-grade SATA drives failing per day.

Read/write timeEach of the 2TB drives takes approximately best case 24,390 seconds to be read and written over the network.

Data ReplicationData replication is the number of the disk drives, plus difference.

Do you have enough bandwidth ?

Page 49: Use Distributed Filesystem as a Storage Tier

Conclusion

49

Page 50: Use Distributed Filesystem as a Storage Tier

14/04/2023

50

Conclusion: next step

Page 51: Use Distributed Filesystem as a Storage Tier

Links

51

Page 52: Use Distributed Filesystem as a Storage Tier

I look forward to meeting you…

XVII European AFS meeting 2010 PILSEN - CZECH REPUBLIC

September 13-15

Who should attend: Everyone interested in deploying a globally accessible

file system Everyone interested in learning more about real world

usage of Kerberos authentication in single realm and federated single sign-on environments

Everyone who wants to share their knowledge and experience with other members of the AFS and Kerberos communities

Everyone who wants to find out the latest developments affecting AFS and Kerberos

More Info: http://afs2010.civ.zcu.cz/

14/04/2023

52

Page 53: Use Distributed Filesystem as a Storage Tier

Thank you

[email protected]