Presentazione laurea 1.2 matteo concas

48
Network filesystems in heterogeneous cloud applications Supervisor: Massimo Masera (Univesità di Torino, INFN) Company Tutor: Stefano Bagnasco (INFN, TO) Tutor: Dario Berzano (INFN, TO) candidate: Matteo Concas

description

My Bachelor presentation.

Transcript of Presentazione laurea 1.2 matteo concas

Page 1: Presentazione laurea 1.2   matteo concas

Network filesystems in heterogeneous cloud applications

Supervisor: Massimo Masera (Univesità di Torino, INFN)Company Tutor: Stefano Bagnasco (INFN, TO)Tutor: Dario Berzano (INFN, TO)

candidate: Matteo Concas

Page 2: Presentazione laurea 1.2   matteo concas

Computing @LHC: how is the GRID structured?

ATLAS

CMS

ALICE

LHCb

15 PB/yearof raw data

Tier-0 Tier-1 Tier-2

- FZK(Karlsruhe)

- CNAF*(Bologna)

- ...

-IN2P3(Lyon)

- Catania

- Torino

- Bari

- ...

- Legnaro

Data are distributed over a federated network called Grid, which is hierarchically organized in Tiers.

(~1 GB/s)

Page 3: Presentazione laurea 1.2   matteo concas

Computing infrastructure @INFN Torino

V.M.*

V.M.V.M.

V.M.

V.M.

V.M.

*V.M. = virtual machine

Grid node.Batch processes: submitted jobs are queued and executed as soon as there are enough free resources. Output is stored on Grid storage asynchronously.

Data storage

Data storage

Data storage

job submission

Alice Proof facility.Interactive processes: all resources are allocated at the same time. Job splitting is dynamic and results are returned immediately to the client.

data retrieval

continuous 2-way communication

Generic virtual farms.VMs can be added dynamically and removed as needed. End user doesn't know how is his/her farm is physically structured.

remote logincloud storage

cloud storage

cloud storage New generation cloud storage

Legacy Tier-2Data Storage

Page 4: Presentazione laurea 1.2   matteo concas

Distributing and federating the storage

Page 5: Presentazione laurea 1.2   matteo concas

Introduction: Distributed storage● Aggregation of several storages:

○ Several nodes and disks seen as one pool in the same LAN (Local Area Network)

○ Many pools aggregated geographically through WAN → cloud storage (Wide Area Network)

○ Concurrent access by many clients is optimized → “closest” replica

LAN

LAN

Geo-replication WAN

Site 1

Site 2

Client 1

Client m

Client ...Client i

Client ...Client m-1

Network filesystems are the backbone of these infrastructures

Page 6: Presentazione laurea 1.2   matteo concas

Why distributing the storage?● Local disk pools:

○ several disks: no single hard drive can be big enough → aggregate disks

○ several nodes: some number crunching, and network, required to look up and serve data → distribute the load

○ client scalability → serve many clients○ on local pools, filesystem operations (r, w, mkdir, etc.) are

synchronous

● Federated storage (scale is geographical):○ single site cannot contain all data○ moving job processing close to their data, not vice versa

→ distributed data ⇔ distributed computing○ filesystem operations are asynchronous

Page 7: Presentazione laurea 1.2   matteo concas

Distributed storage solutions● Every distributed storage has:

○ a backend which aggregates disks○ a frontend which serves data over a network

● Many solutions:○ Lustre, GPFS, GFS → popular in the Grid world○ stackable, e.g.: aggregate with Lustre, serve with

NFS

● NFS is not a distributed storage → does not aggregate, only network

Page 8: Presentazione laurea 1.2   matteo concas

Levels of aggregation in Torino● Hardware aggregation (RAID) of hard drives → virtual block devices

(LUN: logical unit number)

● Software aggregation of block devices → each LUN is aggregated using Oracle Lustre:○ separated server to keep "file information" (MDS: metadata server)○ one or more servers attached to the block devices (OSS: object

storage servers)○ quasi-vertical scalability → "master" server (i.e., MDS) is a bottleneck,

can add more (hard & critical work!)

● Global federation → the local filesystem is exposed through xrootd:○ Torino's storage is part of a global federation○ used by the ALICE experiment @ CERN○ a global, external "file catalog" knows whether a file is in Torino or not

Page 9: Presentazione laurea 1.2   matteo concas

What is GlusterFS● Open source, distributed network filesystem claiming to scale

up to several petabytes and handling many clients

● Horizontal scalability → distributed workload through "bricks"

● Reliability:○ elastic management → maintenance operations are

online○ can add, remove, replace without stopping service○ rebalance → when adding a new "brick", fill to ensure

even distribution of data○ self-healing on "replicated" volumes → form of automatic

failback & failover

Page 10: Presentazione laurea 1.2   matteo concas

● GlusterFS servers cross-communicate with no central manager → horizontal scalability

GlusterFS structure

BrickGlusterFS

Hypervisor

BrickGlusterFS

Hypervisor

BrickGlusterFS

Hypervisor

BrickGlusterFS

Hypervisor

GlusterFS volume

p2p connection

Page 11: Presentazione laurea 1.2   matteo concas

Stage activities

Page 12: Presentazione laurea 1.2   matteo concas

Preliminary studies● Verify compatibility of GlusterFS precompiled

packages (RPMs) on CentOS 5 and 6 for the production environment

● Packages not available for development versions: new functionalities tested from source code (e.g. Object storage)

● Test on virtual machines (first on local VirtualBox then on INFN Torino OpenNebula cloud) http://opennebula.org/

Page 13: Presentazione laurea 1.2   matteo concas

Types of benchmarks

● Generic stress benchmarks conducted on:○ Super distributed prototype○ Pre-existing production volumes

● Specific stress benchmark conducted on some type of GlusterFS volumes (e.g. replicated volume)

● Application specific tests:○ High energies physics analysis running on ROOT

PROOF

Page 14: Presentazione laurea 1.2   matteo concas

● Tests conducted in two different circumstances:

a. storage built for the sole purpose of testing: the volume would be less performing than infrastructure ones for the benchmarks

b. volumes of production were certainly subject to interferences due to concurrent processes

"Why perform these tests?"

Note

Page 15: Presentazione laurea 1.2   matteo concas

Motivations

● Verify consistency of the "release notes":→ test all the different volume types: ○ replicated○ striped○ distributed

● Test GlusterFS in a realistic environment

→ build a prototype as similar as possible to production infrastructure

Page 16: Presentazione laurea 1.2   matteo concas

● GlusterFS v3.3 turned out to be stable after tests conducted both on VirtualBox and OpenNebula VMs

● Next step: build an experimental "super distributed" prototype: a realistic testbed environment consisting of:○ #40 HDDs [500 GB each]→ ~20 TB (1 TB≃10^12 B)○ GlusterFS installed on every hypervisor○ Each hypervisor mounted 2 HDDs → 1 TB each○ all the hypervisors were connected each other (LAN)

● Software used for benchmarks: bonnie++○ very simple to use read/write benchmark for disks○ http://www.coker.com.au/bonnie++/

Experimental setup

Page 17: Presentazione laurea 1.2   matteo concas

Striped volume

(source: www.gluster.org)

● used in high concurrency environments accessing large files (in our case ~10 GB);

● useful to store large data sets, if they have to be accessed from multiple instances.

Page 18: Presentazione laurea 1.2   matteo concas

Striped volume / results

Average Sequential Write

per Blocks [MB/s]

Std. Deviation Sequential Write

per Blocks [MB/s]

Average Sequential

Rewrite[MB/s]

Std. Deviation Sequential

Rewrite [MB/s]

Average Sequential Read

per Blocks [MB/s]

Std. Deviation Sequential Read

per Blocks [MB/s]

striped 38.6 1.3 23.0 3.6 44.7 1.3

Page 19: Presentazione laurea 1.2   matteo concas

● Has the second best result in write (per blocks), and the most stable one (lowest stddev)

Striped volume / comments

> for i in {1..10}; do bonnie++ -d$SOMEPATH -s5000 -r2500 -f; done;

Each test is repeated 10 times

Size of written files [MB] (at least double the RAM size)

Machine RAM size, although GlusterFS doesn't have any sort of file cache

Software used is bonnie++ v1.96

Page 20: Presentazione laurea 1.2   matteo concas

Replicated volume:● used where high-availability and high-reliability are

critical● main task → create forms of redundancy: more

important the data availability than high performances in I/O

● requires a great use of resources, both disk space and CPU usage (especiallyduring the self-healingprocedure)

(source: www.gluster.org)

Page 21: Presentazione laurea 1.2   matteo concas

Replicated volume:● Self healing feature: given "N" redundant

servers, if at maximum (N-1) crash → services keep running on the volume ⇝ servers restored → get synchronized with the one(s) that didn't crash

● Self healing feature was tested turning off servers (even abruptly!) during I/O processes

Page 22: Presentazione laurea 1.2   matteo concas

Replicated / results

Average Sequential Write per

Blocks [MB/s]

Std. Deviation Sequential Write per

Blocks [MB/s]

Average Sequential

Rewrite[MB/s]

Std. Deviation Sequential

Rewrite [MB/s]

Average Sequential Read per

Blocks [MB/s]

Std. Deviation Sequential Read per

Blocks [MB/s]

replicated 35.5 2.5 19.1 16.1 52.2 7.1

Page 23: Presentazione laurea 1.2   matteo concas

Replicated / comments● Low rates in write and the best result in read →

writes need to be synchronized, read throughput benefits from multiple sources

● very important in building stable volumes in critical nodes

● "Self healing" feature worked fine: uses all available cores during resynchronization process, and it does it online (i.e. with no service interruption, only slowdowns!)

Page 24: Presentazione laurea 1.2   matteo concas

Distributed volume:● Files are spread across the bricks in a fashion that

ensures uniform distribution● Pure distributed volume only if redundancy is not

required or lies elsewhere (e.g. RAID)● If no redundancy, disk/server failure can result in

loss of data, but onlysome bricks areaffected, not thewhole volume!

(source: www.gluster.org)

Page 25: Presentazione laurea 1.2   matteo concas

Distributed / results

Average Sequential Write per

Blocks [MB/s]

Std. Deviation Sequential Write per

Blocks [MB/s]

Average Sequential

Rewrite[MB/s]

Std. Deviation Sequential

Rewrite [MB/s]

Average Sequential Read per

Blocks [MB/s]

Std. Deviation Sequential Read per

Blocks [MB/s]

distributed 39.8 5.4 22.3 2.8 52.1 2.2

Page 26: Presentazione laurea 1.2   matteo concas

Distributed / comments

● Best result in write and the second best result in input → high performances volume

● Since volume is not striped, and no high client concurrency was used, we don't exploit the full potentialities of GlusterFS → done in subsequent tests

Some other tests were also conducted on different mixed types of volumes (e.g. striped+replicated)

Page 27: Presentazione laurea 1.2   matteo concas

Overall comparison

Page 28: Presentazione laurea 1.2   matteo concas

Production volumes● Tests conducted on two volumes used at INFN

Torino computing center: the VM images repository and the disk where running VMs are hosted

● Tests executed without production services interruption → expect results to be slightly influenced by contemporary computing activities (even if they were not network-intensive)

Page 29: Presentazione laurea 1.2   matteo concas

Production volumes: Imagerepo

virtual-machine-img1virtual-machine-img2

virtual-machine-img-n...

virtual-machine-img-3

Images Repository

Hypervisor 1 Hypervisor mHypervisor 3Hypervisor 2 ...

mount mount mount mount

Network

Page 30: Presentazione laurea 1.2   matteo concas

Production volumes: Vmdir

Servicehypervisor

Servicehypervisor

Servicehypervisor

Servicehypervisor

GlusterFS volume

I/O stream

I/O stream I/O streamI/O stream

Page 31: Presentazione laurea 1.2   matteo concas

Production volumes / Results

Page 32: Presentazione laurea 1.2   matteo concas

Production volumes / Results (2)Average

Sequential Write per

Blocks [MB/s]

Std. Deviation Sequential Write per

Blocks [MB/s]

Average Sequential

Rewrite[MB/s]

Std. Deviation Sequential

Output Rewrite [MB/s]

Average Sequential Read per

Blocks [MB/s]

Std. Deviation Sequential Read per Blocks [MB/s]

Image Repository 64.4 3.3 38.0 0.4 98.3 2.3

Running VMs volume 47.6 2.2 24.8 1.5 62.7 0.8

● Imagerepo is a distributed volume (GlusterFS →1 brick) ● Running VMs volume is a replicated volume → worse

performances, but single point of failure eliminated by replicating both disks and servers

● Both volumes are more performant than the testbed ones→ better underlying hardware resources used

Page 33: Presentazione laurea 1.2   matteo concas

PROOF test● PROOF: ROOT-based framework for interactive

(non-batch, unlike Grid) physics analysis, used by ALICE and ATLAS, officially part of the computing model

● Simulate a real use case → not artificial, with a storage constituted of 3 LUN (over a RAID5) of 17 TB each in distributed mode

● many concurrent accesses: GlusterFS scalability is extensively exploited

Page 34: Presentazione laurea 1.2   matteo concas

PROOF test / Results

● Optimal range of concurrent accesses: 84-96● Plateau beyond optimal range

Concurrent Processes MB/S

60 473

66 511

72 535

78 573

84 598

96 562

108 560

Page 35: Presentazione laurea 1.2   matteo concas

Conclusions and possible developments

● GlusterFS v3.3.1 was considered stable and satisfying all the prerequisites needed from a network filesystem.→ upgrade was performed and currently in use!

● Make some more tests (e.g. in different use cases)

● Look for next developments in GlusterFS v3.4.x → probably improvement and integration with QEMU/KVM

http://www.gluster.org/2012/11/integration-with-kvmqemu

Page 36: Presentazione laurea 1.2   matteo concas

Thanks for your attention

Thanks to:● Prof. Massimo Masera● Stefano Bagnasco ● Dario Berzano

Page 37: Presentazione laurea 1.2   matteo concas

Backup slides

Page 38: Presentazione laurea 1.2   matteo concas

GlusterFS actors

(source: www.gluster.org)

Page 39: Presentazione laurea 1.2   matteo concas

Conclusions: overall comparison

Page 40: Presentazione laurea 1.2   matteo concas

Striped + Replicated volume:

● it stripes data across replicated bricks in the cluster;

● one should use striped replicated volumes in highly concurrent environments where there is parallel access of very large files and performance is critical;

Page 41: Presentazione laurea 1.2   matteo concas

Striped + replicated / results

Average Sequential Output per

Blocks [MB/s]

Std. Deviation

Sequential Output per

Blocks [MB/s]

Average Sequential

Output Rewrite[MB/s]

Std. Deviation

Sequential Output Rewrite [MB/s]

Average Sequential Input per Blocks [MB/s]

Std. Deviation

Sequential Input per Blocks [MB/s]

striped+replicated

31.0 0.3 18.4 4.7 44.5 1.6

Page 42: Presentazione laurea 1.2   matteo concas

Striped + replicated / comments

● Tests conducted on these volumes covered always one I/O process at time, so it's quite normal that a volume type thought for highly concurrent environments seems to be less performant.

● It keeps discrete I/O ratings.

Page 43: Presentazione laurea 1.2   matteo concas

Imagerepo / results

Average Sequential Output per

Blocks [MB/s]

Std. Deviation

Sequential Output per

Blocks [MB/s]

Average Sequential

Output Rewrite[MB/s]

Std. Deviation

Sequential Output Rewrite [MB/s]

Average Sequential Input per Blocks [MB/s]

Std. Deviation

Sequential Input per Blocks [MB/s]

imagerepo 98.3 3.3 38.0 0.4 64.4 2.3

Page 44: Presentazione laurea 1.2   matteo concas

Imagerepo / comments● The input and output (per block) tests gave an high

value compared with the previous tests, this due to the greater availability of resources.

● Imagerepo is the repository where are stored the images of virtual machines ready to be cloned and turned on in vmdir.

● It's very important that this repository is always up in order to avoid data loss, so is recommended to create a replicated repository.

Page 45: Presentazione laurea 1.2   matteo concas

Vmdir / results

Average Sequential Output per

Blocks [MB/s]

Std. Deviation

Sequential Output per

Blocks [MB/s]

Average Sequential

Output Rewrite[MB/s]

Std. Deviation

Sequential Output Rewrite [MB/s]

Average Sequential Input per Blocks [MB/s]

Std. Deviation

Sequential Input per Blocks [MB/s]

vmdir 47.6 2.2 24.8 1.5 62.7 0.8

Page 46: Presentazione laurea 1.2   matteo concas

vmdir / comments● These result are worse than the imagerepo's ones

but still better than the first three (test-volume).● It is a volume shared from two server towards 5

machines where are hosted the virtual machine instances, so is very important that this volume doesn't crash.

● It's the best candidate to be a replicated+striped+distributed volume.

Page 47: Presentazione laurea 1.2   matteo concas

BrickGlusterFS

Hypervisor

BrickGlusterFS

Hypervisor

BrickGlusterFS

Hypervisor

BrickGlusterFS

Hypervisor

GlusterFS volume

p2p connection

Page 48: Presentazione laurea 1.2   matteo concas

from: Gluster_File_System-3.3.0-Administration_Guide-en-US (see more at: www.gluster.org/community/documentation)