File Classification in self-* storage systems Michael Mesnier, Eno Thereska, Gregory R. Ganger,...

13
File Classification in self-* storage systems Michael Mesnier, Eno Ther eska, Gregory R. Ganger, Daniel Ellard, Margo Selt zer

Transcript of File Classification in self-* storage systems Michael Mesnier, Eno Thereska, Gregory R. Ganger,...

Page 1: File Classification in self-* storage systems Michael Mesnier, Eno Thereska, Gregory R. Ganger, Daniel Ellard, Margo Seltzer.

File Classification in self-* storage systems

Michael Mesnier, Eno Thereska, Gregory R. Ganger, Daniel Ellard,

Margo Seltzer

Page 2: File Classification in self-* storage systems Michael Mesnier, Eno Thereska, Gregory R. Ganger, Daniel Ellard, Margo Seltzer.

Introduction

Self-* infrastructure need information about Users Applications Policies

Not readily provided, and cannot depend on them to provide them

So? Must be learned

Page 3: File Classification in self-* storage systems Michael Mesnier, Eno Thereska, Gregory R. Ganger, Daniel Ellard, Margo Seltzer.

Self-* storage systems Sub-problem of the self-* structure Key: to get hints based on what

creators associate with their files File size File names Lifetimes

Intentions determined, then decisions can be made

Results: better file organization, performance

Page 4: File Classification in self-* storage systems Michael Mesnier, Eno Thereska, Gregory R. Ganger, Daniel Ellard, Margo Seltzer.

Classifying Files

Current: rule-of-thumb policy selection Generic, not optimized

Better: distinguish classes Finer grained policies Ideally assigned at file creation

Determine classes at creation Self-* must learn this association

1) traces 2)running fs

Page 5: File Classification in self-* storage systems Michael Mesnier, Eno Thereska, Gregory R. Ganger, Daniel Ellard, Margo Seltzer.

So, how? Create model that classify based on (some

attribs) Name Owner Permissions

Must filter out irrelevant attribs Classifier must learn rules to do so

Based on test set Then inference happens

Page 6: File Classification in self-* storage systems Michael Mesnier, Eno Thereska, Gregory R. Ganger, Daniel Ellard, Margo Seltzer.

The right model

Model must be Scalable Dynamic Cost-sensitive (mis-prediction cost) Interpretable (human)

Model selected: decision trees

Page 7: File Classification in self-* storage systems Michael Mesnier, Eno Thereska, Gregory R. Ganger, Daniel Ellard, Margo Seltzer.

ABLE

Attribute-based learning environment 1. obtain traces 2. make decision tree 3. make predictions

Top down, until all attribs are used Split sample until leaves have similar file attri

bs After creation, query begins

Page 8: File Classification in self-* storage systems Michael Mesnier, Eno Thereska, Gregory R. Ganger, Daniel Ellard, Margo Seltzer.

Tests

Based on several systems to make sure it is workload-independent DEAS03 EECS03 CAMPUS LAB

The control: MODE algorithm – places all files in a single cluster

Page 9: File Classification in self-* storage systems Michael Mesnier, Eno Thereska, Gregory R. Ganger, Daniel Ellard, Margo Seltzer.

Results

Prediction results quite good 90% - 100% claimed Clustering files by attribs are clear Predict that a model’s ruleset will conv

erge over time

Page 10: File Classification in self-* storage systems Michael Mesnier, Eno Thereska, Gregory R. Ganger, Daniel Ellard, Margo Seltzer.

Benefits of incremental learning

Dynamically refines model as samples become available

Generally better than one-shot learners Sometimes one-shot performs poorly

Ruleset of incremental learners are smaller

Page 11: File Classification in self-* storage systems Michael Mesnier, Eno Thereska, Gregory R. Ganger, Daniel Ellard, Margo Seltzer.

On accuracy More attributes = chance of over-fitting

More rules -> smaller ratios Loses compression benefits

Predictive models can have false predictions Can impact performance

Things that should be in RAM is placed on disk instead etc.

Solution: cost functions Penalize errors Create biased tree System goals will need to be translated into it

Page 12: File Classification in self-* storage systems Michael Mesnier, Eno Thereska, Gregory R. Ganger, Daniel Ellard, Margo Seltzer.

Conclusion

These trees provide prediction accuracies in the 90% range

Adaptable via incremental learning Continued work: integration into

self-* infrastructure

Page 13: File Classification in self-* storage systems Michael Mesnier, Eno Thereska, Gregory R. Ganger, Daniel Ellard, Margo Seltzer.

Questions?