Post on 17-Dec-2015
File Classification in self-* storage systems
Michael Mesnier, Eno Thereska, Gregory R. Ganger, Daniel Ellard,
Margo Seltzer
Introduction
Self-* infrastructure need information about Users Applications Policies
Not readily provided, and cannot depend on them to provide them
So? Must be learned
Self-* storage systems Sub-problem of the self-* structure Key: to get hints based on what
creators associate with their files File size File names Lifetimes
Intentions determined, then decisions can be made
Results: better file organization, performance
Classifying Files
Current: rule-of-thumb policy selection Generic, not optimized
Better: distinguish classes Finer grained policies Ideally assigned at file creation
Determine classes at creation Self-* must learn this association
1) traces 2)running fs
So, how? Create model that classify based on (some
attribs) Name Owner Permissions
Must filter out irrelevant attribs Classifier must learn rules to do so
Based on test set Then inference happens
The right model
Model must be Scalable Dynamic Cost-sensitive (mis-prediction cost) Interpretable (human)
Model selected: decision trees
ABLE
Attribute-based learning environment 1. obtain traces 2. make decision tree 3. make predictions
Top down, until all attribs are used Split sample until leaves have similar file attri
bs After creation, query begins
Tests
Based on several systems to make sure it is workload-independent DEAS03 EECS03 CAMPUS LAB
The control: MODE algorithm – places all files in a single cluster
Results
Prediction results quite good 90% - 100% claimed Clustering files by attribs are clear Predict that a model’s ruleset will conv
erge over time
Benefits of incremental learning
Dynamically refines model as samples become available
Generally better than one-shot learners Sometimes one-shot performs poorly
Ruleset of incremental learners are smaller
On accuracy More attributes = chance of over-fitting
More rules -> smaller ratios Loses compression benefits
Predictive models can have false predictions Can impact performance
Things that should be in RAM is placed on disk instead etc.
Solution: cost functions Penalize errors Create biased tree System goals will need to be translated into it
Conclusion
These trees provide prediction accuracies in the 90% range
Adaptable via incremental learning Continued work: integration into
self-* infrastructure
Questions?