EE324 INTRO TO DISTRIBUTED SYSTEMS L-20 More DFS.
-
Upload
jasmine-osborne -
Category
Documents
-
view
227 -
download
0
description
Transcript of EE324 INTRO TO DISTRIBUTED SYSTEMS L-20 More DFS.
EE324 INTRO TO DISTRIBUTED SYSTEMS
L-20 More DFS
Andrew File System Let’s start with a familiar example:
andrew
10,000sof
machines
10,000sof
people
Goal: Have a consistent namespace for files across computersAllow any authorized user to access their files from any computer
Disk Disk Disk
Terabytes of disk
AFS summary Client-side caching is a fundamental technique to improve
scalability and performance But raises important questions of cache consistency
Timeouts and callbacks are common methods for providing (some forms of) consistency.
AFS picked close-to-open consistency as a good balance of usability (the model seems intuitive to users), performance, etc. AFS authors argued that apps with highly concurrent,
shared access, like databases, needed a different model
4
Today's Lecture Other types of DFS
Coda – disconnected operation Programming assignment 4
5
Background We are back to 1990s. Network is slow and not stable Terminal “powerful” client
33MHz CPU, 16MB RAM, 100MB hard drive Mobile Users appeared
1st IBM Thinkpad in 1992 We can do work at client without net-
work
6
CODA Successor of the very successful Andrew
File System (AFS) AFS
First DFS aimed at a campus-sized user community
Key ideas include open-to-close consistency callbacks
7
Hardware Model CODA and AFS assume that client work-
stations are personal computers con-trolled by their user/owner Fully autonomous Cannot be trusted
CODA allows owners of laptops to oper-ate them in disconnected mode Opposite of ubiquitous connectivity
8
Accessibility Must handle two types of failures
Server failures: Data servers are replicated
Communication failures and voluntary disconnections
Coda uses optimistic replication and file hoarding
9
Design Rationale Scalability
Callback cache coherence (inherit from AFS)
Whole file caching Portable workstations
User’s assistance in cache management
10
Design Rationale –Replica Control
Pessimistic Disable all partitioned writes - Require a client to acquire control of a
cached object prior to disconnection Optimistic
Assuming no others touching the file- sophisticated: conflict detection + fact: low write-sharing in Unix+ high availability: access anything in range
without lock
11
Pessimistic Replica Control Would require client to acquire exclu-
sive (RW) or shared (R) control of cached objects before accessing them in disconnected mode: Acceptable solution for voluntary discon-
nections Does not work for involuntary disconnec-
tions What if the laptop remains disconnected
for a long time?
12
Leases We could grant exclusive/shared control
of the cached objects for a limited amount of time
Works very well in connected mode Reduces server workload Server can keep leases in volatile storage
as long as their duration is shorter than boot time
Would only work for very short discon-nection periods
13
Optimistic Replica Control (I) Optimistic replica control allows ac-
cess in every disconnected mode Tolerates temporary inconsistencies Promises to detect them later Provides much higher data availability
14
Optimistic Replica Control (II) Defines an accessible universe: set of
replicas that the user can access Accessible universe varies over time
At any time, user Will read from the latest replica(s) in his ac-
cessible universe Will update all replicas in his accessible
universe
15
Coda (Venus) States
1. Hoarding:Normal operation mode
2. Emulating:Disconnected operation mode
3. Reintegrating:Propagates changes and detects inconsistencies
Hoarding
Emulating Recovering
16
Hoarding Hoard useful data for disconnection Balance the needs of connected and dis-
connected operation. Cache size is restricted Unpredictable disconnections
Prioritized algorithm – cache manage hoard walking – reevaluate objects
17
Prioritized algorithm User defined hoard priority p: how interest
it is? Recent Usage q Object priority = f(p,q) Kick out the one with lowest priority+ Fully tunable
Everything can be customized- Not tunable (?)
- No idea how to customize
18
Emulation In emulation mode:
Attempts to access files that are not in the client caches appear as failures to applica-tion
All changes are written in a persistent log,the client modification log (CML)
19
Persistence Venus keeps its cache and related data
structures in non-volatile storage
20
Reintegration When workstation gets reconnected, Coda
initiates a reintegration process Performed one volume at a time Venus ships replay log to all volumes Each volume performs a log replay algorithm
Only care write/write confliction Succeed?
Yes. Free logs, reset priority No. Save logs to a tar. Ask for help
21
Performance Duration of Reintegration
A few hours disconnection 1 min But sometimes much longer
Cache size 100MB at client is enough for a “typical” workday
Conflicts No Conflict at all! Why? Over 99% modification by the same person Two users modify the same obj within a day:
<0.75%
22
Coda Summary Puts scalability and availability before
data consistency Unlike NFS
Assumes that inconsistent updates are very infrequent
Introduced disconnected operation mode and file hoarding
23
Today's Lecture Other types of DFS
Coda – disconnected operation Programming assignment 4
Filesystems Last time: Looked at how we could use
RPC to split filesystem functionality be-tween client and server
But pretty much, we didn’t change the de-sign
We just moved the entire filesystem to the server and then added some caching on the
client in various ways
You can go farther... But it requires ripping apart the filesys-
tem functionality into modules and placing those modules at different
computers on the network So now we need to ask...
what does a filesystem do, anyway?
Well, there’s a disk... disks store bits. in fixed-length pieces
called sectors or blocks but a filesystem has ... files. and often
directories. and maybe permissions. creation and modification time. and other stuff about the files. (“meta-data”)
Filesystem functionality Directory management (maps entries
in a hierarchy of names to files-on-disk) File management (manages adding,
reading, changing, appending, delet-ing) individual files
Space management: where on disk to store these things?
Metadata management
Conventional filesystem Wraps all of these up together Useful concepts: [pictures]
“Superblock” -- well-known location on disk where top-level filesys-tem info is stored (pointers to more structures, etc.)
“Free list” or “Free space bitmap” -- data structures to remember what’s used on disk and what’s not. Why? Fast allocation of space for new files.
“inode” - short for index node - stores all metadata about a file, plus information pointing to where the file is stored on disk Small files may be referenced entirely from the inode; larger
files may have some indirection to blocks that list locations on disk
Directory entries point to inodes “extent” - a way of remembering where on disk a file is stored. In-
stead of listing all blocks, list a starting block and a range. More compact representation, but requires large contiguous block alloca-tion.
Filesystem “VFS” ops VFS: (‘virtual filesystem‘): common abstraction
layer inside kernels for building filesystems -- in-terface is common across FS implementations Think of this as an abstract data type for
filesystems has both syntax (function names, return val-
ues, etc) and semantics (“don’t block on this call”, etc.)
One key thing to note: The VFS itself may do some caching and other management... in particular: often maintains an inode cache
FUSE The lab will use FUSE
FUSE is a way to implement filesystems in user space (as normal programs), but have them available through the kernel -- like normal files
It has a kinda VFS-like interface
Figure from FUSE documentation
Directory operations readdir(path) - return directory entries
for each file in the directory mkdir(path) -- create a new directory rmdir(path) -- remove the named direc-
tory
File operations mknod(path, mode, dev) -- create a new “node” (generic: a file is
one type of node; a device node is another) unlink(path) -- remove link to inode, decrementing inode’s reference
count many filesystems permit “hard links” -- multiple directory entries
pointing to the same file rename(path, newpath) open -- open a file, returning a file handle , read, write truncate -- cut off at particular length flush -- close one handle to an open file release -- completely close file handle
Metadata ops
getattr(path) -- return metadata struct
chmod / chown (ownership & perms)
Back to goals of DFS Users should have same view of system, be able to
share files Last time:
Central fileserver handles all filesystem operations -- consistency was easy, but overhead high, scala-bility poor
Moved to NFS and then AFS: Added more and more caching at client; added cache consistency prob-lems Solved using timeouts or callbacks to expire
cached contents
Protocol & consistency Remember last time: NFS defined operations to occur on
unique inode #s instead of names... why? idempotency. Wanted operations to be unique. Related example for today when we’re considering split-
ting up components: moving a file from one directory to another
What if this is a complex operation (“remove from one”, “add to another”), etc. Can another user see intermediate state?? (e.g., file in
both directories or file in neither?) Last time: Saw issue of when things become consistent
Presented idea of close-to-open consistency as a com-promise
Scaling beyond... What happens if you want to build AFS for all
of CMU? More disks than one machine can handle; more users than one machine can handle
Simplest idea: Partition users onto different servers How do we handle a move across servers? How to divide the users? Statically? What
about load balancing for operations & for space? Some files become drastically more popular?
“Cluster” filesystems Lab inspired by Frangipani, a scalable
distributed filesystem. Think back to our list of things that
filesystems have to do Concurrency management Space allocation and data storage Directory management and naming
Frangipani design
ProgramFrangipani file server
Distributed lock service
Petal distributed virtual disk
Physical disks
Petal aggregates many disks (across many machines_ into
one big “virtual disk”. Simplifying abstraction for
both design &implementation. exports
extents - provides allocation, deallocation, etc.
Internally: maps (virtual disk, offset) to (server, physical disk, offset)
Frangipani stores all data (inodes, directories, data) in petal; uses lock server for
consistency (eg, creating file)
Consequential design
Compare with NFS/AFS In NFS/AFS, clients just relay all FS calls to the
server; central server. Here, clients run enough code to know which
server to direct things to; are active participants in filesystem.
(n.b. -- you could, of course, use the Frangipani/Petal design to build a scalable NFS server -- and, in fact, similar techniques are how a lot of them actually are built. See upcoming lecture on RAID, though: replication and redundancy management become key)
Lab 2: YFS Yet-another File System. :) Simpler version of what we just talked
about: only one extent server (you don’t have to implement Petal; single lock server)
Each server written in C++ yfs_client interfaces with OS through
fuse Following labs will build YFS incremen-
tally, starting with the lock server and building up through supporting file & directory ops distributed around the network
Warning This lab is difficult.
Assumes a bit more C++ than lab 1 did. Please please please get started early;
ask course staff for help. It will not destroy you; it will make you
stronger. But it may well take a lot of work and be pretty intensive.
45
46
Remember this slide? We are back to 1990s. Network is slow and not stable Terminal “powerful” client
33MHz CPU, 16MB RAM, 100MB hard drive Mobile Users appear
1st IBM Thinkpad in 1992
47
What’s now? We are in 2000s now. Network is fast and reliable in LAN “powerful” client very powerful client
2.4GHz CPU, 1GB RAM, 120GB hard drive Mobile users everywhere Do we still need disconnection?
How many people are using coda?
48
Do we still need disconnection?
WAN and wireless is not very reliable, and is slow
PDA is not very powerful 200MHz strongARM, 128M CF Card Electric power constrained
LBFS (MIT) on WAN, Coda and Odyssey (CMU) for mobile users Adaptation is also important
49
What is the future? High bandwidth, reliable wireless every-
where Even PDA is powerful
2GHz, 1G RAM/Flash
What will be the research topic in FS? P2P?
50
Today's Lecture DFS design comparisons continued
Topic 4: file access consistency NFS, AFS, Sprite, and DCE DFS
Topic 5: Locking Other types of DFS
Coda – disconnected operation LBFS – weakly connected operation
51
Low Bandwidth File SystemKey Ideas
A network file systems for slow or wide-area networks
Exploits similarities between files or ver-sions of the same file Avoids sending data that can be found in
the server’s file system or the client’s cache
Also uses conventional compression and caching
Requires 90% less bandwidth than tradi-tional network file systems
52
Working on slow networks Make local copies
Must worry about update conflicts Use remote login
Only for text-based applications Use instead a LBFS
Better than remote login Must deal with issues like auto-saves block-
ing the editor for the duration of transfer
53
LBFS design Provides close-to-open consistency Uses a large, persistent file cache at
client Stores clients working set of files
LBFS server divides file it stores into chunks and indexes the chunks by hash value
Client similarly indexes its file cache Exploits similarities between files
LBFS never transfers chunks that the recip-ient already has
54
Indexing Uses the SHA-1 algorithm for hashing
It is collision resistant Central challenge in indexing file chunks
is keeping the index at a reasonable size while dealing with shifting offsets Indexing the hashes of fixed size data
blocks Indexing the hashes of all overlapping
blocks at all offsets
55
LBFS indexing solution Considers only non-overlapping chunks Sets chunk boundaries based on file con-
tents rather than on position within a file Examines every overlapping 48-byte re-
gion of file to select the boundary re-gions called breakpoints using Rabin fin-gerprints When low-order 13 bits of region’s finger-
print equals a chosen value, the region constitutes a breakpoint
56
Effects of edits on file chunks
Chunks of file before/after edits Grey shading show edits
Stripes show 48byte regions with magic hash values creating chunk boundaries
57
More Indexing Issues Pathological cases
Very small chunks Sending hashes of chunks would consume as
much bandwidth as just sending the file Very large chunks
Cannot be sent in a single RPC LBFS imposes minimum and maximum
chuck sizes
58
The Chunk Database Indexes each chunk by the first 64 bits of
its SHA-1 hash To avoid synchronization problems, LBFS
always recomputes the SHA-1 hash of any data chunk before using it Simplifies crash recovery
Recomputed SHA-1 values are also used to detect hash collisions in the database
59
Conclusion Under normal circumstances, LBFS con-
sumes 90% less bandwidth than tradi-tional file systems.
Makes transparent remote file access a viable and less frustrating alternative to running interactive programs on remote machines.