Deciding When to Forget in the Elephant File System
description
Transcript of Deciding When to Forget in the Elephant File System
1
Deciding When to Forget in the Elephant File System
University of British Columbia:Douglas. S. Santry, Michael J. Feeley, Norman C. Hutchinson, Ross
W. Carton, and Jacob OfirHewlett-Packard Laboratories:
Alistair. C. Veitch
December 1999
Presentated by: David Allen
May 31st, 2005
2
Elephant File System: Overview
• Undo and Long-Term History File system that helps to protect data by keeping
histories of file and directory changes.
• User Control Gives control over retention policies to the user. Can be applied at the file level.
• Storage Reclamation Separates storage reclamation from file operations
such as write and delete. Cleaner runs in background to reclaim storage and
support the retention policy.
3
Elephant file system: Why
• User Failures There is already good protection from
network, system and media failures. Now we need to protect from user mistakes.
rm *.o is not the same as rm * o
4
High-End Disk Capacity by Year
0
50
100
150
200
250
300
350
400
450
1999 2000 2001 2002 2003 2004 2005
Year
Cap
acity (M
B)
Elephant file system: Why
• Cheap Disk Space Single inexpensive
disks were approaching 50GB at time of paper in 1999.
Now in 2005 they are approaching 500GB.
They will be 2TB by 2010.
5
Elephant file system: Why
• Cheap Disk Space In addition to high-end disk capacity increasing 10x in
6 years, the price is more than 10 times cheaper.
High-End Disk Price per GB by Year
$0.00
$2.00
$4.00
$6.00
$8.00
$10.00
$12.00
$14.00
$16.00
1999 2000 2001 2002 2003 2004 2005
Year
Price/G
B
Rough Price for High-End Disk Drive
$0.00
$100.00
$200.00
$300.00
$400.00
$500.00
$600.00
$700.00
$800.00
1996 1997 1998 1999 2000 2001 2002 2004 2005
Year
Pri
ce
6
Elephant file system: Why
• Cheap Disk Space Other types of media as well.
8GB compact flash
6GB micro drives
(Useful for that 16.7MP Canon camera. 42MB images.)
7
Elephant file system: Why
• Capacity Large disk capacities. Constant human productivity. Only a relatively small set of files that need
protection.
It makes sense to support revision histories on files and directories.
8
Elephant file system: Change
• Change in pattern of use. Does this paper stand up to changes in disk
usage? Explosion of large files from still and video
digital cameras, mp3 CD rips, and divx DVD rips.
I have 17.8GB of pictures and video from one trip, which I need to prune and edit to a final form.
How would people in the class use this system?
9
Elephant file system: Policies
• Keep One (no versioning) Just like the FFS. Files changes can overwrite
existing data, and are permanent.
10
Elephant file system: Policies
• Keep All (complete versioning) Like revision control systems. Entire history is
maintained.
11
Elephant file system: Policies
• Keep Safe (undo protection) Keeps recent changes for a specified undo
period.
undo period
12
Elephant file system: Policies
• Keep Landmarks (long-term history) In addition to Keep Safe protection, retain
important file versions.
undo period
13
Elephant file system: Policies
• Application Defined (user specified) Custom policy implemented at the user level.
14
Elephant file system: Features for Comparison
• User Control Only retains history on user selected files, with user
selected policies. Custom policies can be created. Landmarks can be user specified.
• Automation Implemented within the file system. Revisions are maintained automatically as the files
are used. Landmarks can be determined automatically. Cleaning is done in the background.
15
Elephant file system: Features for Comparison
• Granularity Every file and directory change can be kept. Full or partial long term histories can be maintained. Files can be grouped to maintain consistency for
landmarking. Versioning on files is done at the block level.
• Access Specific version can be specified with a file and date
pair. Only the current version can be written to. Most recent revision is fastest, but all versions can be
accessed relatively quickly. Only a single version exists at a time.
16
Elephant file system: Features for Comparison
• Storage Files with no versions are stored as efficiently as files without
versioning. Revisions to inodes are stored in a inode log, which uses full
blocks and is much larger than a single inode. Directories are stored as name histories.
17
Elephant file system vs. the Trash Can
• User Control Users manually empty the trash can. This causes files to have different
levels of protection based on when they were deleted and when the trash can was emptied.
• Automation Files are automatically moved to the trash can on delete.
• Granularity Very coarse-grained. Only protects files against accidental deletion. Only until the trash can is emptied. No directory protection.
• Access Files can retrieved from the trash can, but the user needs to determine
where to put it.• Storage
Copy of entire file is kept in the trash can.
18
Elephant file system vs. Backups• User Control
Typically no control over system backups. Users can manually copy files.
• Automation System backups are usually automatic.
• Granularity Very coarse over time. No fine grained revisioning No protection between backups. Typically limited by backup retention policy (number of tapes).
• Access System backups are usually very expensive to retrieve. User manual backups are usually closer, but not always convenient.
• Storage Usually full or differential copies of the data.
19
Elephant file system vs. Checkpoints
• User Control Typically no user control over checkpoints.
• Automation Checkpoints are usually automatic.
• Granularity Very coarse over time. No fine grained revisioning No protection between backups. Typically limited by checkpoints retention policy (space).
• Access Typically on-line, easy to get to.
• Storage Efficient. Copy-on-write policy maintains changes to file system
after the checkpoint.
20
Elephant file system vs. Revision Control System
• User Control Only retains history on user selected files, but usually best to use
revision control on all files in a directory. No policies to select, entire history is retained. File groups can be "tagged" to establish a consistent version. (Like
landmarks and grouping.)• Automation
No automation. Usually a set of command line tools that are initiated by the user.
Checkout, commit...• Granularity
Medium granularity. Only committed changes are kept. All versions are retained. Often it is difficult or impossible to remove old
versions. Typically revision control does not include directories. (CVS) Often renaming or moving files will break file histories. (CVS,
SourceSafe)
21
Elephant file system vs. Revision Control System
• Access Files can be accessed by name and version. Only most recent files can be modified. Older versions can be branched. Branches can be merged. Multiple branches (versions) can exists at a time.
• Storage Text file are usually stored efficiently as differentials. Access is fast for recent versions and slow for old
versions. Binary file storage is usually inefficient, full copies.
22
Elephant file system: Summary• Most files don't need versioning so impact is low.• Performance is very close to a system with no
versioning.• Storage cost of metadata is high in the prototype
implementation.
• Disk capacity has increased as predicted in this paper, but so has the need for capacity due to digital music and imaging.
• Usage patterns have also changed for the same reasons.
• Does this system still make as much sense in the face of these changes? Definitely!
23
References• "Deciding When to Forget in the Elephant File System."
D. S. Santry, M. J. Feeley, N. C. Hutchinson, A. C. Veitch, R. W. Carton, and J. Or, In Proceedings of the Seventeenth ACM Symposium on Operating Systems Principles, December 12-15, 1999, Charleston, SC, pp. 110-123.
• Historic disk capacity and price data: http://www.littletechshoppe.com/ns1625/winchest.html
• Current media capacities and prices: http://froogle.google.com