File Classification in self-* storage systems Michael Mesnier, Eno Thereska, Gregory R. Ganger,...

File Classification in self-* storage systems

Michael Mesnier, Eno Thereska, Gregory R. Ganger, Daniel Ellard,

Margo Seltzer

Introduction

Self-* infrastructure need information about Users Applications Policies

Not readily provided, and cannot depend on them to provide them

So? Must be learned

Self-* storage systems Sub-problem of the self-* structure Key: to get hints based on what

creators associate with their files File size File names Lifetimes

Intentions determined, then decisions can be made

Results: better file organization, performance

Classifying Files

Current: rule-of-thumb policy selection Generic, not optimized

Better: distinguish classes Finer grained policies Ideally assigned at file creation

Determine classes at creation Self-* must learn this association

1) traces 2)running fs

So, how? Create model that classify based on (some

attribs) Name Owner Permissions

Must filter out irrelevant attribs Classifier must learn rules to do so

Based on test set Then inference happens

The right model

Model must be Scalable Dynamic Cost-sensitive (mis-prediction cost) Interpretable (human)

Model selected: decision trees

Attribute-based learning environment 1. obtain traces 2. make decision tree 3. make predictions

Top down, until all attribs are used Split sample until leaves have similar file attri

bs After creation, query begins

Based on several systems to make sure it is workload-independent DEAS03 EECS03 CAMPUS LAB

The control: MODE algorithm – places all files in a single cluster

Results

Prediction results quite good 90% - 100% claimed Clustering files by attribs are clear Predict that a model’s ruleset will conv

erge over time

Benefits of incremental learning

Dynamically refines model as samples become available

Generally better than one-shot learners Sometimes one-shot performs poorly

Ruleset of incremental learners are smaller

On accuracy More attributes = chance of over-fitting

More rules -> smaller ratios Loses compression benefits

Predictive models can have false predictions Can impact performance

Things that should be in RAM is placed on disk instead etc.

Solution: cost functions Penalize errors Create biased tree System goals will need to be translated into it

Conclusion

These trees provide prediction accuracies in the 90% range

Adaptable via incremental learning Continued work: integration into

self-* infrastructure

Questions?

File Classification in self-* storage systems Michael Mesnier, Eno Thereska, Gregory R. Ganger,...

Documents

Transcript of File Classification in self-* storage systems Michael Mesnier, Eno Thereska, Gregory R. Ganger,...

Texas State Leadership Institute Conference Reportgato-docs.its.txstate.edu/jcr:8a79346a-4395-4b21...Texas State Leadership Institute Conference Report ... Emily Ellard Christie Fealy

From kit car to Toyota production line; process redesign in the Exeter laboratory Sian Ellard, Carolyn Tysoe, Martina Owens, Melissa Sloman, Kevin Colclough,

MIPS Assembly Tutorial by Daniel J Ellard

Ellard oysters-health-surveillance-program

Stardust: Tracking Activity in a Distributed Storage System · Stardust: Tracking Activity in a Distributed Storage System Eno Thereska, Brandon Salmon, John Strunk, Matthew Wachs,

15-410, S’04 - 61 - Disks Mar. 24, 2004 Dave Eckhardt & Bruce Maggs Brian Railing & Steve Muckle Contributions from Eno Thereska 15-213 “How Stuff Works”

Update of the tasmanian pacific oyster health surveillance program kevin ellard and graham knowles

MarketFocus - Hughes Ellard 2015 MarketFocus ... or Harnish Patel at Hughes Ellard. ... 1580 Solent Business Park comprises an early 1990s ground floor office

PartnersFinancial Advisory Board - NFP · Mike James . Johnny Adcock . Jay Cavenaugh . John Ellard . Advanced Sales . Committee Chairman . PF/BenefitsPartners . Committee Chairman

TBBT-Trace Based file system Benchmarking Tool Ningning Zhu, Jiawu Chen, Tzi-cker Chiueh Stony Brook University Daniel Ellard Harvard University Fast’04.

Service Contracts Beauty and the Beast - The Food Safety ... · Service Contracts “Beauty and the Beast” Raymond Ellard Director . Food Safety Authority of Ireland

office Space available Ellard Village - LPC Southeast · 2015. 10. 26. · office Space available Ellard Village 3405 Piedmont Road NE • Suite 100 • Atlanta, GA 30305 • The

IOFlow: a Software-Defined Storage Architecture Eno Thereska, Hitesh Ballani, Greg O’Shea, Thomas Karagiannis, Antony Rowstron, Tom Talpey, Richard Black,

Stardust: Tracking Activity in a Distributed Storage Systemeno/research/thereska_sigmetrics06.pdf · Stardust: Tracking Activity in a Distributed Storage System Eno Thereska, Brandon

Www.snia.org OSD TWG 1 Mike Mesnier January 2003 Object-based Storage 101 SNIA.

Annet Damhuis, Vikki Moye and Sian Ellard Department of Molecular Genetics.

Passive NFS Tracing of Email and Research Workloads Daniel Ellard, Jonathan Ledlie, Pia Malkani, Margo Seltzer FAST 2003 - April 1, 2003.

Stardust: Tracking Activity in a Distributed Storage Systemjclopez/ref/stardust-sigmetrics06.pdf · Stardust: Tracking Activity in a Distributed Storage System Eno Thereska, Brandon

Yoga Active M Mesnier

PDL Packet Spring Update - Carnegie Mellon · PDF filePDL Packet Spring Update Comparison-Based File Server ... Learning Architecture for Disk Layout Salmon, Thereska, ... el query