EE324 INTRO TO DISTRIBUTED SYSTEMS L-20 More DFS.

EE324 INTRO TO DISTRIBUTED SYSTEMS

L-20 More DFS

Andrew File System Let’s start with a familiar example:

andrew

10,000sof

machines

10,000sof

people

Goal: Have a consistent namespace for files across computersAllow any authorized user to access their files from any computer

Disk Disk Disk

Terabytes of disk

AFS summary Client-side caching is a fundamental technique to improve

scalability and performance But raises important questions of cache consistency

Timeouts and callbacks are common methods for providing (some forms of) consistency.

AFS picked close-to-open consistency as a good balance of usability (the model seems intuitive to users), performance, etc. AFS authors argued that apps with highly concurrent,

shared access, like databases, needed a different model

4

Today's Lecture Other types of DFS

Coda – disconnected operation Programming assignment 4

5

Background We are back to 1990s. Network is slow and not stable Terminal “powerful” client

33MHz CPU, 16MB RAM, 100MB hard drive Mobile Users appeared

1st IBM Thinkpad in 1992 We can do work at client without net-

work

6

CODA Successor of the very successful Andrew

File System (AFS) AFS

First DFS aimed at a campus-sized user community

Key ideas include open-to-close consistency callbacks

7

Hardware Model CODA and AFS assume that client work-

stations are personal computers con-trolled by their user/owner Fully autonomous Cannot be trusted

CODA allows owners of laptops to oper-ate them in disconnected mode Opposite of ubiquitous connectivity

8

Accessibility Must handle two types of failures

Server failures: Data servers are replicated

Communication failures and voluntary disconnections

Coda uses optimistic replication and file hoarding

9

Design Rationale Scalability

Callback cache coherence (inherit from AFS)

Whole file caching Portable workstations

User’s assistance in cache management

10

Design Rationale –Replica Control

Pessimistic Disable all partitioned writes - Require a client to acquire control of a

cached object prior to disconnection Optimistic

Assuming no others touching the file- sophisticated: conflict detection + fact: low write-sharing in Unix+ high availability: access anything in range

without lock

11

Pessimistic Replica Control Would require client to acquire exclu-

sive (RW) or shared (R) control of cached objects before accessing them in disconnected mode: Acceptable solution for voluntary discon-

nections Does not work for involuntary disconnec-

tions What if the laptop remains disconnected

for a long time?

12

Leases We could grant exclusive/shared control

of the cached objects for a limited amount of time

Works very well in connected mode Reduces server workload Server can keep leases in volatile storage

as long as their duration is shorter than boot time

Would only work for very short discon-nection periods

13

Optimistic Replica Control (I) Optimistic replica control allows ac-

cess in every disconnected mode Tolerates temporary inconsistencies Promises to detect them later Provides much higher data availability

14

Optimistic Replica Control (II) Defines an accessible universe: set of

replicas that the user can access Accessible universe varies over time

At any time, user Will read from the latest replica(s) in his ac-

cessible universe Will update all replicas in his accessible

universe

15

Coda (Venus) States

1. Hoarding:Normal operation mode

2. Emulating:Disconnected operation mode

3. Reintegrating:Propagates changes and detects inconsistencies

Hoarding

Emulating Recovering

16

Hoarding Hoard useful data for disconnection Balance the needs of connected and dis-

connected operation. Cache size is restricted Unpredictable disconnections

Prioritized algorithm – cache manage hoard walking – reevaluate objects

17

Prioritized algorithm User defined hoard priority p: how interest

it is? Recent Usage q Object priority = f(p,q) Kick out the one with lowest priority+ Fully tunable

Everything can be customized- Not tunable (?)

- No idea how to customize

18

Emulation In emulation mode:

Attempts to access files that are not in the client caches appear as failures to applica-tion

All changes are written in a persistent log,the client modification log (CML)

19

Persistence Venus keeps its cache and related data

structures in non-volatile storage

20

Reintegration When workstation gets reconnected, Coda

initiates a reintegration process Performed one volume at a time Venus ships replay log to all volumes Each volume performs a log replay algorithm

Only care write/write confliction Succeed?

Yes. Free logs, reset priority No. Save logs to a tar. Ask for help

21

Performance Duration of Reintegration

A few hours disconnection 1 min But sometimes much longer

Cache size 100MB at client is enough for a “typical” workday

Conflicts No Conflict at all! Why? Over 99% modification by the same person Two users modify the same obj within a day:

<0.75%

22

Coda Summary Puts scalability and availability before

data consistency Unlike NFS

Assumes that inconsistent updates are very infrequent

Introduced disconnected operation mode and file hoarding

23

Today's Lecture Other types of DFS

Coda – disconnected operation Programming assignment 4

Filesystems Last time: Looked at how we could use

RPC to split filesystem functionality be-tween client and server

But pretty much, we didn’t change the de-sign

We just moved the entire filesystem to the server and then added some caching on the

client in various ways

You can go farther... But it requires ripping apart the filesys-

tem functionality into modules and placing those modules at different

computers on the network So now we need to ask...

what does a filesystem do, anyway?

Well, there’s a disk... disks store bits. in fixed-length pieces

called sectors or blocks but a filesystem has ... files. and often

directories. and maybe permissions. creation and modification time. and other stuff about the files. (“meta-data”)

Filesystem functionality Directory management (maps entries

in a hierarchy of names to files-on-disk) File management (manages adding,

reading, changing, appending, delet-ing) individual files

Space management: where on disk to store these things?

Metadata management

Conventional filesystem Wraps all of these up together Useful concepts: [pictures]

“Superblock” -- well-known location on disk where top-level filesys-tem info is stored (pointers to more structures, etc.)

“Free list” or “Free space bitmap” -- data structures to remember what’s used on disk and what’s not. Why? Fast allocation of space for new files.

“inode” - short for index node - stores all metadata about a file, plus information pointing to where the file is stored on disk Small files may be referenced entirely from the inode; larger

files may have some indirection to blocks that list locations on disk

Directory entries point to inodes “extent” - a way of remembering where on disk a file is stored. In-

stead of listing all blocks, list a starting block and a range. More compact representation, but requires large contiguous block alloca-tion.

Filesystem “VFS” ops VFS: (‘virtual filesystem‘): common abstraction

layer inside kernels for building filesystems -- in-terface is common across FS implementations Think of this as an abstract data type for

filesystems has both syntax (function names, return val-

ues, etc) and semantics (“don’t block on this call”, etc.)

One key thing to note: The VFS itself may do some caching and other management... in particular: often maintains an inode cache

FUSE The lab will use FUSE

FUSE is a way to implement filesystems in user space (as normal programs), but have them available through the kernel -- like normal files

It has a kinda VFS-like interface

Figure from FUSE documentation

Directory operations readdir(path) - return directory entries

for each file in the directory mkdir(path) -- create a new directory rmdir(path) -- remove the named direc-

tory

File operations mknod(path, mode, dev) -- create a new “node” (generic: a file is

one type of node; a device node is another) unlink(path) -- remove link to inode, decrementing inode’s reference

count many filesystems permit “hard links” -- multiple directory entries

pointing to the same file rename(path, newpath) open -- open a file, returning a file handle , read, write truncate -- cut off at particular length flush -- close one handle to an open file release -- completely close file handle

Metadata ops

getattr(path) -- return metadata struct

chmod / chown (ownership & perms)

Back to goals of DFS Users should have same view of system, be able to

share files Last time:

Central fileserver handles all filesystem operations -- consistency was easy, but overhead high, scala-bility poor

Moved to NFS and then AFS: Added more and more caching at client; added cache consistency prob-lems Solved using timeouts or callbacks to expire

cached contents

Protocol & consistency Remember last time: NFS defined operations to occur on

unique inode #s instead of names... why? idempotency. Wanted operations to be unique. Related example for today when we’re considering split-

ting up components: moving a file from one directory to another

What if this is a complex operation (“remove from one”, “add to another”), etc. Can another user see intermediate state?? (e.g., file in

both directories or file in neither?) Last time: Saw issue of when things become consistent

Presented idea of close-to-open consistency as a com-promise

Scaling beyond... What happens if you want to build AFS for all

of CMU? More disks than one machine can handle; more users than one machine can handle

Simplest idea: Partition users onto different servers How do we handle a move across servers? How to divide the users? Statically? What

about load balancing for operations & for space? Some files become drastically more popular?

“Cluster” filesystems Lab inspired by Frangipani, a scalable

distributed filesystem. Think back to our list of things that

filesystems have to do Concurrency management Space allocation and data storage Directory management and naming

Frangipani design

ProgramFrangipani file server

Distributed lock service

Petal distributed virtual disk

Physical disks

Petal aggregates many disks (across many machines_ into

one big “virtual disk”. Simplifying abstraction for

both design &implementation. exports

extents - provides allocation, deallocation, etc.

Internally: maps (virtual disk, offset) to (server, physical disk, offset)

Frangipani stores all data (inodes, directories, data) in petal; uses lock server for

consistency (eg, creating file)

Consequential design

Compare with NFS/AFS In NFS/AFS, clients just relay all FS calls to the

server; central server. Here, clients run enough code to know which

server to direct things to; are active participants in filesystem.

(n.b. -- you could, of course, use the Frangipani/Petal design to build a scalable NFS server -- and, in fact, similar techniques are how a lot of them actually are built. See upcoming lecture on RAID, though: replication and redundancy management become key)

Lab 2: YFS Yet-another File System. :) Simpler version of what we just talked

about: only one extent server (you don’t have to implement Petal; single lock server)

Each server written in C++ yfs_client interfaces with OS through

fuse Following labs will build YFS incremen-

tally, starting with the lock server and building up through supporting file & directory ops distributed around the network

Warning This lab is difficult.

Assumes a bit more C++ than lab 1 did. Please please please get started early;

ask course staff for help. It will not destroy you; it will make you

stronger. But it may well take a lot of work and be pretty intensive.

46

Remember this slide? We are back to 1990s. Network is slow and not stable Terminal “powerful” client

33MHz CPU, 16MB RAM, 100MB hard drive Mobile Users appear

1st IBM Thinkpad in 1992

47

What’s now? We are in 2000s now. Network is fast and reliable in LAN “powerful” client very powerful client

2.4GHz CPU, 1GB RAM, 120GB hard drive Mobile users everywhere Do we still need disconnection?

How many people are using coda?

48

Do we still need disconnection?

WAN and wireless is not very reliable, and is slow

PDA is not very powerful 200MHz strongARM, 128M CF Card Electric power constrained

LBFS (MIT) on WAN, Coda and Odyssey (CMU) for mobile users Adaptation is also important

49

What is the future? High bandwidth, reliable wireless every-

where Even PDA is powerful

2GHz, 1G RAM/Flash

What will be the research topic in FS? P2P?

50

Today's Lecture DFS design comparisons continued

Topic 4: file access consistency NFS, AFS, Sprite, and DCE DFS

Topic 5: Locking Other types of DFS

Coda – disconnected operation LBFS – weakly connected operation

51

Low Bandwidth File SystemKey Ideas

A network file systems for slow or wide-area networks

Exploits similarities between files or ver-sions of the same file Avoids sending data that can be found in

the server’s file system or the client’s cache

Also uses conventional compression and caching

Requires 90% less bandwidth than tradi-tional network file systems

52

Working on slow networks Make local copies

Must worry about update conflicts Use remote login

Only for text-based applications Use instead a LBFS

Better than remote login Must deal with issues like auto-saves block-

ing the editor for the duration of transfer

53

LBFS design Provides close-to-open consistency Uses a large, persistent file cache at

client Stores clients working set of files

LBFS server divides file it stores into chunks and indexes the chunks by hash value

Client similarly indexes its file cache Exploits similarities between files

LBFS never transfers chunks that the recip-ient already has

54

Indexing Uses the SHA-1 algorithm for hashing

It is collision resistant Central challenge in indexing file chunks

is keeping the index at a reasonable size while dealing with shifting offsets Indexing the hashes of fixed size data

blocks Indexing the hashes of all overlapping

blocks at all offsets

55

LBFS indexing solution Considers only non-overlapping chunks Sets chunk boundaries based on file con-

tents rather than on position within a file Examines every overlapping 48-byte re-

gion of file to select the boundary re-gions called breakpoints using Rabin fin-gerprints When low-order 13 bits of region’s finger-

print equals a chosen value, the region constitutes a breakpoint

56

Effects of edits on file chunks

Chunks of file before/after edits Grey shading show edits

Stripes show 48byte regions with magic hash values creating chunk boundaries

57

More Indexing Issues Pathological cases

Very small chunks Sending hashes of chunks would consume as

much bandwidth as just sending the file Very large chunks

Cannot be sent in a single RPC LBFS imposes minimum and maximum

chuck sizes

58

The Chunk Database Indexes each chunk by the first 64 bits of

its SHA-1 hash To avoid synchronization problems, LBFS

always recomputes the SHA-1 hash of any data chunk before using it Simplifies crash recovery

Recomputed SHA-1 values are also used to detect hash collisions in the database

59

Conclusion Under normal circumstances, LBFS con-

sumes 90% less bandwidth than tradi-tional file systems.

Makes transparent remote file access a viable and less frustrating alternative to running interactive programs on remote machines.

EE324 INTRO TO DISTRIBUTED SYSTEMS L-20 More DFS.

Documents

Transcript of EE324 INTRO TO DISTRIBUTED SYSTEMS L-20 More DFS.