File Classification in self-* storage systems Michael Mesnier, Eno Thereska, Gregory R. Ganger,...

Post on 17-Dec-2015

212 views 0 download

Transcript of File Classification in self-* storage systems Michael Mesnier, Eno Thereska, Gregory R. Ganger,...

File Classification in self-* storage systems

Michael Mesnier, Eno Thereska, Gregory R. Ganger, Daniel Ellard,

Margo Seltzer

Introduction

Self-* infrastructure need information about Users Applications Policies

Not readily provided, and cannot depend on them to provide them

So? Must be learned

Self-* storage systems Sub-problem of the self-* structure Key: to get hints based on what

creators associate with their files File size File names Lifetimes

Intentions determined, then decisions can be made

Results: better file organization, performance

Classifying Files

Current: rule-of-thumb policy selection Generic, not optimized

Better: distinguish classes Finer grained policies Ideally assigned at file creation

Determine classes at creation Self-* must learn this association

1) traces 2)running fs

So, how? Create model that classify based on (some

attribs) Name Owner Permissions

Must filter out irrelevant attribs Classifier must learn rules to do so

Based on test set Then inference happens

The right model

Model must be Scalable Dynamic Cost-sensitive (mis-prediction cost) Interpretable (human)

Model selected: decision trees

ABLE

Attribute-based learning environment 1. obtain traces 2. make decision tree 3. make predictions

Top down, until all attribs are used Split sample until leaves have similar file attri

bs After creation, query begins

Tests

Based on several systems to make sure it is workload-independent DEAS03 EECS03 CAMPUS LAB

The control: MODE algorithm – places all files in a single cluster

Results

Prediction results quite good 90% - 100% claimed Clustering files by attribs are clear Predict that a model’s ruleset will conv

erge over time

Benefits of incremental learning

Dynamically refines model as samples become available

Generally better than one-shot learners Sometimes one-shot performs poorly

Ruleset of incremental learners are smaller

On accuracy More attributes = chance of over-fitting

More rules -> smaller ratios Loses compression benefits

Predictive models can have false predictions Can impact performance

Things that should be in RAM is placed on disk instead etc.

Solution: cost functions Penalize errors Create biased tree System goals will need to be translated into it

Conclusion

These trees provide prediction accuracies in the 90% range

Adaptable via incremental learning Continued work: integration into

self-* infrastructure

Questions?