EE324 INTRO TO DISTRIBUTED SYSTEMS L-20 More DFS.

59
EE324 INTRO TO DISTRIBUTED SYSTEMS L-20 More DFS

description

AFS summary  Client-side caching is a fundamental technique to improve scalability and performance  But raises important questions of cache consistency  Timeouts and callbacks are common methods for providing (some forms of) consistency.  AFS picked close-to-open consistency as a good balance of usability (the model seems intuitive to users), performance, etc.  AFS authors argued that apps with highly concurrent, shared access, like databases, needed a different model

Transcript of EE324 INTRO TO DISTRIBUTED SYSTEMS L-20 More DFS.

Page 1: EE324 INTRO TO DISTRIBUTED SYSTEMS L-20 More DFS.

EE324 INTRO TO DISTRIBUTED SYSTEMS

L-20 More DFS

Page 2: EE324 INTRO TO DISTRIBUTED SYSTEMS L-20 More DFS.

Andrew File System Let’s start with a familiar example:

andrew

10,000sof

machines

10,000sof

people

Goal: Have a consistent namespace for files across computersAllow any authorized user to access their files from any computer

Disk Disk Disk

Terabytes of disk

Page 3: EE324 INTRO TO DISTRIBUTED SYSTEMS L-20 More DFS.

AFS summary Client-side caching is a fundamental technique to improve

scalability and performance But raises important questions of cache consistency

Timeouts and callbacks are common methods for providing (some forms of) consistency.

AFS picked close-to-open consistency as a good balance of usability (the model seems intuitive to users), performance, etc. AFS authors argued that apps with highly concurrent,

shared access, like databases, needed a different model

Page 4: EE324 INTRO TO DISTRIBUTED SYSTEMS L-20 More DFS.

4

Today's Lecture Other types of DFS

Coda – disconnected operation Programming assignment 4

Page 5: EE324 INTRO TO DISTRIBUTED SYSTEMS L-20 More DFS.

5

Background We are back to 1990s. Network is slow and not stable Terminal “powerful” client

33MHz CPU, 16MB RAM, 100MB hard drive Mobile Users appeared

1st IBM Thinkpad in 1992 We can do work at client without net-

work

Page 6: EE324 INTRO TO DISTRIBUTED SYSTEMS L-20 More DFS.

6

CODA Successor of the very successful Andrew

File System (AFS) AFS

First DFS aimed at a campus-sized user community

Key ideas include open-to-close consistency callbacks

Page 7: EE324 INTRO TO DISTRIBUTED SYSTEMS L-20 More DFS.

7

Hardware Model CODA and AFS assume that client work-

stations are personal computers con-trolled by their user/owner Fully autonomous Cannot be trusted

CODA allows owners of laptops to oper-ate them in disconnected mode Opposite of ubiquitous connectivity

Page 8: EE324 INTRO TO DISTRIBUTED SYSTEMS L-20 More DFS.

8

Accessibility Must handle two types of failures

Server failures: Data servers are replicated

Communication failures and voluntary disconnections

Coda uses optimistic replication and file hoarding

Page 9: EE324 INTRO TO DISTRIBUTED SYSTEMS L-20 More DFS.

9

Design Rationale Scalability

Callback cache coherence (inherit from AFS)

Whole file caching Portable workstations

User’s assistance in cache management

Page 10: EE324 INTRO TO DISTRIBUTED SYSTEMS L-20 More DFS.

10

Design Rationale –Replica Control

Pessimistic Disable all partitioned writes - Require a client to acquire control of a

cached object prior to disconnection Optimistic

Assuming no others touching the file- sophisticated: conflict detection + fact: low write-sharing in Unix+ high availability: access anything in range

without lock

Page 11: EE324 INTRO TO DISTRIBUTED SYSTEMS L-20 More DFS.

11

Pessimistic Replica Control Would require client to acquire exclu-

sive (RW) or shared (R) control of cached objects before accessing them in disconnected mode: Acceptable solution for voluntary discon-

nections Does not work for involuntary disconnec-

tions What if the laptop remains disconnected

for a long time?

Page 12: EE324 INTRO TO DISTRIBUTED SYSTEMS L-20 More DFS.

12

Leases We could grant exclusive/shared control

of the cached objects for a limited amount of time

Works very well in connected mode Reduces server workload Server can keep leases in volatile storage

as long as their duration is shorter than boot time

Would only work for very short discon-nection periods

Page 13: EE324 INTRO TO DISTRIBUTED SYSTEMS L-20 More DFS.

13

Optimistic Replica Control (I) Optimistic replica control allows ac-

cess in every disconnected mode Tolerates temporary inconsistencies Promises to detect them later Provides much higher data availability

Page 14: EE324 INTRO TO DISTRIBUTED SYSTEMS L-20 More DFS.

14

Optimistic Replica Control (II) Defines an accessible universe: set of

replicas that the user can access Accessible universe varies over time

At any time, user Will read from the latest replica(s) in his ac-

cessible universe Will update all replicas in his accessible

universe

Page 15: EE324 INTRO TO DISTRIBUTED SYSTEMS L-20 More DFS.

15

Coda (Venus) States

1. Hoarding:Normal operation mode

2. Emulating:Disconnected operation mode

3. Reintegrating:Propagates changes and detects inconsistencies

Hoarding

Emulating Recovering

Page 16: EE324 INTRO TO DISTRIBUTED SYSTEMS L-20 More DFS.

16

Hoarding Hoard useful data for disconnection Balance the needs of connected and dis-

connected operation. Cache size is restricted Unpredictable disconnections

Prioritized algorithm – cache manage hoard walking – reevaluate objects

Page 17: EE324 INTRO TO DISTRIBUTED SYSTEMS L-20 More DFS.

17

Prioritized algorithm User defined hoard priority p: how interest

it is? Recent Usage q Object priority = f(p,q) Kick out the one with lowest priority+ Fully tunable

Everything can be customized- Not tunable (?)

- No idea how to customize

Page 18: EE324 INTRO TO DISTRIBUTED SYSTEMS L-20 More DFS.

18

Emulation In emulation mode:

Attempts to access files that are not in the client caches appear as failures to applica-tion

All changes are written in a persistent log,the client modification log (CML)

Page 19: EE324 INTRO TO DISTRIBUTED SYSTEMS L-20 More DFS.

19

Persistence Venus keeps its cache and related data

structures in non-volatile storage

Page 20: EE324 INTRO TO DISTRIBUTED SYSTEMS L-20 More DFS.

20

Reintegration When workstation gets reconnected, Coda

initiates a reintegration process Performed one volume at a time Venus ships replay log to all volumes Each volume performs a log replay algorithm

Only care write/write confliction Succeed?

Yes. Free logs, reset priority No. Save logs to a tar. Ask for help

Page 21: EE324 INTRO TO DISTRIBUTED SYSTEMS L-20 More DFS.

21

Performance Duration of Reintegration

A few hours disconnection 1 min But sometimes much longer

Cache size 100MB at client is enough for a “typical” workday

Conflicts No Conflict at all! Why? Over 99% modification by the same person Two users modify the same obj within a day:

<0.75%

Page 22: EE324 INTRO TO DISTRIBUTED SYSTEMS L-20 More DFS.

22

Coda Summary Puts scalability and availability before

data consistency Unlike NFS

Assumes that inconsistent updates are very infrequent

Introduced disconnected operation mode and file hoarding

Page 23: EE324 INTRO TO DISTRIBUTED SYSTEMS L-20 More DFS.

23

Today's Lecture Other types of DFS

Coda – disconnected operation Programming assignment 4

Page 24: EE324 INTRO TO DISTRIBUTED SYSTEMS L-20 More DFS.

Filesystems Last time: Looked at how we could use

RPC to split filesystem functionality be-tween client and server

But pretty much, we didn’t change the de-sign

We just moved the entire filesystem to the server and then added some caching on the

client in various ways

Page 25: EE324 INTRO TO DISTRIBUTED SYSTEMS L-20 More DFS.

You can go farther... But it requires ripping apart the filesys-

tem functionality into modules and placing those modules at different

computers on the network So now we need to ask...

what does a filesystem do, anyway?

Page 26: EE324 INTRO TO DISTRIBUTED SYSTEMS L-20 More DFS.

Well, there’s a disk... disks store bits. in fixed-length pieces

called sectors or blocks but a filesystem has ... files. and often

directories. and maybe permissions. creation and modification time. and other stuff about the files. (“meta-data”)

Page 27: EE324 INTRO TO DISTRIBUTED SYSTEMS L-20 More DFS.

Filesystem functionality Directory management (maps entries

in a hierarchy of names to files-on-disk) File management (manages adding,

reading, changing, appending, delet-ing) individual files

Space management: where on disk to store these things?

Metadata management

Page 28: EE324 INTRO TO DISTRIBUTED SYSTEMS L-20 More DFS.

Conventional filesystem Wraps all of these up together Useful concepts: [pictures]

“Superblock” -- well-known location on disk where top-level filesys-tem info is stored (pointers to more structures, etc.)

“Free list” or “Free space bitmap” -- data structures to remember what’s used on disk and what’s not. Why? Fast allocation of space for new files.

“inode” - short for index node - stores all metadata about a file, plus information pointing to where the file is stored on disk Small files may be referenced entirely from the inode; larger

files may have some indirection to blocks that list locations on disk

Directory entries point to inodes “extent” - a way of remembering where on disk a file is stored. In-

stead of listing all blocks, list a starting block and a range. More compact representation, but requires large contiguous block alloca-tion.

Page 29: EE324 INTRO TO DISTRIBUTED SYSTEMS L-20 More DFS.

Filesystem “VFS” ops VFS: (‘virtual filesystem‘): common abstraction

layer inside kernels for building filesystems -- in-terface is common across FS implementations Think of this as an abstract data type for

filesystems has both syntax (function names, return val-

ues, etc) and semantics (“don’t block on this call”, etc.)

One key thing to note: The VFS itself may do some caching and other management... in particular: often maintains an inode cache

Page 30: EE324 INTRO TO DISTRIBUTED SYSTEMS L-20 More DFS.

FUSE The lab will use FUSE

FUSE is a way to implement filesystems in user space (as normal programs), but have them available through the kernel -- like normal files

It has a kinda VFS-like interface

Page 31: EE324 INTRO TO DISTRIBUTED SYSTEMS L-20 More DFS.

Figure from FUSE documentation

Page 32: EE324 INTRO TO DISTRIBUTED SYSTEMS L-20 More DFS.

Directory operations readdir(path) - return directory entries

for each file in the directory mkdir(path) -- create a new directory rmdir(path) -- remove the named direc-

tory

Page 33: EE324 INTRO TO DISTRIBUTED SYSTEMS L-20 More DFS.

File operations mknod(path, mode, dev) -- create a new “node” (generic: a file is

one type of node; a device node is another) unlink(path) -- remove link to inode, decrementing inode’s reference

count many filesystems permit “hard links” -- multiple directory entries

pointing to the same file rename(path, newpath) open -- open a file, returning a file handle , read, write truncate -- cut off at particular length flush -- close one handle to an open file release -- completely close file handle

Page 34: EE324 INTRO TO DISTRIBUTED SYSTEMS L-20 More DFS.

Metadata ops

getattr(path) -- return metadata struct

chmod / chown (ownership & perms)

Page 35: EE324 INTRO TO DISTRIBUTED SYSTEMS L-20 More DFS.

Back to goals of DFS Users should have same view of system, be able to

share files Last time:

Central fileserver handles all filesystem operations -- consistency was easy, but overhead high, scala-bility poor

Moved to NFS and then AFS: Added more and more caching at client; added cache consistency prob-lems Solved using timeouts or callbacks to expire

cached contents

Page 36: EE324 INTRO TO DISTRIBUTED SYSTEMS L-20 More DFS.

Protocol & consistency Remember last time: NFS defined operations to occur on

unique inode #s instead of names... why? idempotency. Wanted operations to be unique. Related example for today when we’re considering split-

ting up components: moving a file from one directory to another

What if this is a complex operation (“remove from one”, “add to another”), etc. Can another user see intermediate state?? (e.g., file in

both directories or file in neither?) Last time: Saw issue of when things become consistent

Presented idea of close-to-open consistency as a com-promise

Page 37: EE324 INTRO TO DISTRIBUTED SYSTEMS L-20 More DFS.

Scaling beyond... What happens if you want to build AFS for all

of CMU? More disks than one machine can handle; more users than one machine can handle

Simplest idea: Partition users onto different servers How do we handle a move across servers? How to divide the users? Statically? What

about load balancing for operations & for space? Some files become drastically more popular?

Page 38: EE324 INTRO TO DISTRIBUTED SYSTEMS L-20 More DFS.

“Cluster” filesystems Lab inspired by Frangipani, a scalable

distributed filesystem. Think back to our list of things that

filesystems have to do Concurrency management Space allocation and data storage Directory management and naming

Page 39: EE324 INTRO TO DISTRIBUTED SYSTEMS L-20 More DFS.

Frangipani design

ProgramFrangipani file server

Distributed lock service

Petal distributed virtual disk

Physical disks

Petal aggregates many disks (across many machines_ into

one big “virtual disk”. Simplifying abstraction for

both design &implementation. exports

extents - provides allocation, deallocation, etc.

Internally: maps (virtual disk, offset) to (server, physical disk, offset)

Frangipani stores all data (inodes, directories, data) in petal; uses lock server for

consistency (eg, creating file)

Page 40: EE324 INTRO TO DISTRIBUTED SYSTEMS L-20 More DFS.

Consequential design

Page 41: EE324 INTRO TO DISTRIBUTED SYSTEMS L-20 More DFS.

Compare with NFS/AFS In NFS/AFS, clients just relay all FS calls to the

server; central server. Here, clients run enough code to know which

server to direct things to; are active participants in filesystem.

(n.b. -- you could, of course, use the Frangipani/Petal design to build a scalable NFS server -- and, in fact, similar techniques are how a lot of them actually are built. See upcoming lecture on RAID, though: replication and redundancy management become key)

Page 42: EE324 INTRO TO DISTRIBUTED SYSTEMS L-20 More DFS.

Lab 2: YFS Yet-another File System. :) Simpler version of what we just talked

about: only one extent server (you don’t have to implement Petal; single lock server)

Page 43: EE324 INTRO TO DISTRIBUTED SYSTEMS L-20 More DFS.

Each server written in C++ yfs_client interfaces with OS through

fuse Following labs will build YFS incremen-

tally, starting with the lock server and building up through supporting file & directory ops distributed around the network

Page 44: EE324 INTRO TO DISTRIBUTED SYSTEMS L-20 More DFS.

Warning This lab is difficult.

Assumes a bit more C++ than lab 1 did. Please please please get started early;

ask course staff for help. It will not destroy you; it will make you

stronger. But it may well take a lot of work and be pretty intensive.

Page 45: EE324 INTRO TO DISTRIBUTED SYSTEMS L-20 More DFS.

45

Page 46: EE324 INTRO TO DISTRIBUTED SYSTEMS L-20 More DFS.

46

Remember this slide? We are back to 1990s. Network is slow and not stable Terminal “powerful” client

33MHz CPU, 16MB RAM, 100MB hard drive Mobile Users appear

1st IBM Thinkpad in 1992

Page 47: EE324 INTRO TO DISTRIBUTED SYSTEMS L-20 More DFS.

47

What’s now? We are in 2000s now. Network is fast and reliable in LAN “powerful” client very powerful client

2.4GHz CPU, 1GB RAM, 120GB hard drive Mobile users everywhere Do we still need disconnection?

How many people are using coda?

Page 48: EE324 INTRO TO DISTRIBUTED SYSTEMS L-20 More DFS.

48

Do we still need disconnection?

WAN and wireless is not very reliable, and is slow

PDA is not very powerful 200MHz strongARM, 128M CF Card Electric power constrained

LBFS (MIT) on WAN, Coda and Odyssey (CMU) for mobile users Adaptation is also important

Page 49: EE324 INTRO TO DISTRIBUTED SYSTEMS L-20 More DFS.

49

What is the future? High bandwidth, reliable wireless every-

where Even PDA is powerful

2GHz, 1G RAM/Flash

What will be the research topic in FS? P2P?

Page 50: EE324 INTRO TO DISTRIBUTED SYSTEMS L-20 More DFS.

50

Today's Lecture DFS design comparisons continued

Topic 4: file access consistency NFS, AFS, Sprite, and DCE DFS

Topic 5: Locking Other types of DFS

Coda – disconnected operation LBFS – weakly connected operation

Page 51: EE324 INTRO TO DISTRIBUTED SYSTEMS L-20 More DFS.

51

Low Bandwidth File SystemKey Ideas

A network file systems for slow or wide-area networks

Exploits similarities between files or ver-sions of the same file Avoids sending data that can be found in

the server’s file system or the client’s cache

Also uses conventional compression and caching

Requires 90% less bandwidth than tradi-tional network file systems

Page 52: EE324 INTRO TO DISTRIBUTED SYSTEMS L-20 More DFS.

52

Working on slow networks Make local copies

Must worry about update conflicts Use remote login

Only for text-based applications Use instead a LBFS

Better than remote login Must deal with issues like auto-saves block-

ing the editor for the duration of transfer

Page 53: EE324 INTRO TO DISTRIBUTED SYSTEMS L-20 More DFS.

53

LBFS design Provides close-to-open consistency Uses a large, persistent file cache at

client Stores clients working set of files

LBFS server divides file it stores into chunks and indexes the chunks by hash value

Client similarly indexes its file cache Exploits similarities between files

LBFS never transfers chunks that the recip-ient already has

Page 54: EE324 INTRO TO DISTRIBUTED SYSTEMS L-20 More DFS.

54

Indexing Uses the SHA-1 algorithm for hashing

It is collision resistant Central challenge in indexing file chunks

is keeping the index at a reasonable size while dealing with shifting offsets Indexing the hashes of fixed size data

blocks Indexing the hashes of all overlapping

blocks at all offsets

Page 55: EE324 INTRO TO DISTRIBUTED SYSTEMS L-20 More DFS.

55

LBFS indexing solution Considers only non-overlapping chunks Sets chunk boundaries based on file con-

tents rather than on position within a file Examines every overlapping 48-byte re-

gion of file to select the boundary re-gions called breakpoints using Rabin fin-gerprints When low-order 13 bits of region’s finger-

print equals a chosen value, the region constitutes a breakpoint

Page 56: EE324 INTRO TO DISTRIBUTED SYSTEMS L-20 More DFS.

56

Effects of edits on file chunks

Chunks of file before/after edits Grey shading show edits

Stripes show 48byte regions with magic hash values creating chunk boundaries

Page 57: EE324 INTRO TO DISTRIBUTED SYSTEMS L-20 More DFS.

57

More Indexing Issues Pathological cases

Very small chunks Sending hashes of chunks would consume as

much bandwidth as just sending the file Very large chunks

Cannot be sent in a single RPC LBFS imposes minimum and maximum

chuck sizes

Page 58: EE324 INTRO TO DISTRIBUTED SYSTEMS L-20 More DFS.

58

The Chunk Database Indexes each chunk by the first 64 bits of

its SHA-1 hash To avoid synchronization problems, LBFS

always recomputes the SHA-1 hash of any data chunk before using it Simplifies crash recovery

Recomputed SHA-1 values are also used to detect hash collisions in the database

Page 59: EE324 INTRO TO DISTRIBUTED SYSTEMS L-20 More DFS.

59

Conclusion Under normal circumstances, LBFS con-

sumes 90% less bandwidth than tradi-tional file systems.

Makes transparent remote file access a viable and less frustrating alternative to running interactive programs on remote machines.