Open Machine Learning

the open experiment databasemeta-learning for the masses

Joaquin Vanschoren @joavanschoren

The Polymath story

Tim Gowers

Machine Learningare we doing it right?

Computer Science

• The scientific method• Make a hypothesis about the world

• Generate predictions based on this hypothesis

• Design experiments to verify/falsify the prediction

• Predictions verified: hypothesis might be true

• Predictions falsified: hypothesis is wrong

Computer Science

• The scientific method (for ML)• Make a hypothesis about (the structure of) given data

• Generate models based on this hypothesis

• Design experiments to measure accuracy of the models

• Good performance: It works (on this data)

• Bad performance: It doesn’t work on this data

• Aggregates (it works 60% of the time) not useful

Computer Science






• Aggregates (it works 60% of the time) not usefulHow can data be characterized on

which the algorithm works well?

Computer Science






• Aggregates (it works 60% of the time) not usefulHow can data be characterized on

which the algorithm works well? What is the effect of

parameter settings?

Meta-Learning

• The science of understanding which algorithms work well on which types of data

• Hard: thorough understanding of data and algorithms

• Requires good data: extensive experimentation

• Why is this separate from other ML research?• A thorough algorithm evaluation = a meta-learning study

• Original authors know algorithms and data best, have large sets of experiments, are (presumably) interested in knowing on which data their algorithms work well (or not)

Meta-Learning

With the right tools, can we make everyone a meta-learner?

ML algorithm design meta-learning

Large sets of experiments algorithm selection

algorithm characterizationdata characterization

bias-variance analysis

learning curvesdata insight

algorithm insight

algorithm comparisondatasets

source code

Open Machine Learning

Open science

World-wide Telescope

Open science

Microarray Databases

Open science

GenBank

Open machine learning?

• We can also be `open’• Simple, common formats to describe experiments, workflows,

algorithms,...

• Platform to share, store, query, interact

• We can go (much) further• Share experiments automatically (open source ML tools)

• Experiment on-the-fly (cheap, no expensive instruments)

• Controlled experimentation (experimentation engine)

Formalizing machine learning

• Unique names for algorithms, datasets, evaluation measures, data characterizations,... (ontology)

• Based on DMOP, OntoDM, KDOntology, EXPO,...

• Simple, structured way to describe algorithm setups, workflows and experiment runs

• Detailed enough to reproduce all experiments

Run

run

Run

run

Execution of a predefined setup

Run

run

Execution of a predefined setup

setup

Run

setup

run

Run

in

setup

data run

Run

in

setup

data

machine

run

Run

in out

setup

data data

machine

run

Run

in out

setup

data data

machine

run

Also: start time author status,...

Setup

setup

Setup

Plan of what we want to do

setup

Setup

Plan of what we want to do

setup

f(x)algorithm

setupfunction

setupwork!ow experiment

Setup

setup

f(x)algorithm

setupfunction


part of

Hierarchical

Setup

setup

f(x)algorithm

setupfunction


part ofp=!

parameter setting

HierarchicalParameterized

Setup

setup

f(x)algorithm

setupfunction


part ofp=!

parameter setting

HierarchicalParameterized

Abstract/concrete

Algorithm Setup

algorithmsetup

Algorithm Setup

Fully defined algorithm configuration

algorithmsetup

part of

Algorithm Setup

Fully defined algorithm configuration

algorithmsetup

p=!parameter settingimplementation

part of

f(x)function

setup

Algorithm Setup

algorithmsetup

p=!parameter setting

part of

f(x)function

setupimplementation

Algorithm Setup

algorithmsetup

p=!

algorithm

parameter setting

algorithm quality

part of

f(x)function

setup

p=?parameter

f(x)mathematical function

implementation

Algorithm Setup

algorithmsetup

p=!

algorithm

parameter setting

algorithm quality

part of

f(x)function

setup

p=?parameter


implementation

unique names

Algorithm Setup

algorithmsetup

p=!

algorithm

parameter setting

algorithm quality

part of

f(x)function

setup

p=?parameter


implementation

unique names

Roles: learner, base-learner, kernel,...

Setup

setup

f(x)algorithm

setupfunction


part of

Workflow Setup

setup

algorithmsetup

work!ow

part of

Workflow Setup

setup

algorithmsetup

work!ow

part of

source

connection

target

Workflow: components, connections, and parameters (inputs)

Workflow Setup

setup

algorithmsetup

work!ow

part of

source

connection

target

Workflow: components, connections, and parameters (inputs)

Also: ports datatype

WorkflowExample

Weka.ARFFLoader

p=! location= http://...

2:loadData

Weka.Evaluation

p=! F=10

3:crossValidate

Weka.SMO

p=! C=0.01

4:learner

Weka.RBF

f(x) 5:kernel

p=! G=0.01

p=! S=1

data

data

eval

pred

url evalu-ations

predic-tions

par

logRuns=true logRuns=falselogRuns=true

1:mainFlow

WorkflowExample

Weka.ARFFLoader


2:loadData

Weka.Evaluation

p=! F=10

3:crossValidate

Weka.SMO

p=! C=0.01

4:learner

Weka.RBF

f(x) 5:kernel

p=! G=0.01

p=! S=1

data

data

eval

pred

url evalu-ations

predic-tions

par


1:mainFlow

86

Evaluations

7 Predictions

data data evalpred

predictions

evaluations

Weka.Instances

Setup

setup

f(x)algorithm

setupfunction


part of

ExperimentSetup

setup

algorithmsetup

work!ow experiment

part of

<X>experiment

variable

ExperimentSetup

setup

algorithmsetup

work!ow experiment

part of

<X>experiment

variable

setup

Also: experiment design, description, literature reference, author,...

Experiment Setup

Experiment SetupVariables: labeled tuples which can be

referenced in setups

Run

in out

setup

data data

machine

run

Also: start time author status,...

Run

data

dataset evaluation model predictions

Run

sourcedata run


Run

sourcedata run


data quality

EXPMLWeka.ARFFLoader


2:loadData

Weka.Evaluation

p=! F=10

3:crossValidate

Weka.SMO

p=! C=0.01

4:learner

Weka.RBF

f(x) 5:kernel

p=! G=0.01

p=! S=1

data

data

eval

pred

url evalu-ations

predic-tions

par


1:mainFlow

Demo(preview)

Learning curves

0.2$

0.3$

0.4$

0.5$

0.6$

0.7$

0.8$

0.9$

1$

10$ 20$ 30$ 40$ 50$ 60$ 70$ 80$ 90$ 100$

pred

ic've)accuracy)

percentage)of)original)dataset)size)

RandomForest$C45$Logis<cRegression$RacedIncrementalLogitBoostAStump$NaiveBayes$SVMARBF$

Examples

When does one algorithm outperform another?

Examples

Bias-variance profile + effect of dataset size

Examples


boosting

bagging

Examples


Examples

Taking it furtherSeamless integration

• Webservice for sharing, querying experiments

• Integrate experiment sharing in ML tools (WEKA, KNIME, RapidMiner, R, ....)

• Mapping implementations, evaluation measures,...

• Online platform for custom querying, community interaction

• Semantic wiki: algorithm/data descriptions, rankings, ...

Experimentation Engine

• Controlled experimentation (Delve, MLComp)• Download datasets, build training/test sets

• Feed training and test sets to algorithms, retrieve predictions/models

• Run broad set of evaluation measures

• Benchmarking (Cross-Validation), learning curve analysis, bias-variance analysis, workflows(!)

• Compute data properties for new datasets

Why would you use it?(seeding)

• Let the system run the experiments for you

• Immediate, highly detailed benchmarks (no repeats)

• Up to date, detailed results (vs. static, aggregated in journals)

• All your results organized online (private?), anytime, anywhere

• Interact with people (weird results?)

• Get credit for all your results (e.g. citations), unexpected results

• Visibility, new collaborations

• Check if your algorithm really the best (e.g. active testing)

• On which datasets does it perform well/badly?

Question

Is open

machine learning possible?

http://expdb.cs.kuleuven.be

Thanks

Gracias

Xie XieDanke

Dank U

Merci

Efharisto

Dhanyavaad

GrazieSpasiba

Kia oraTesekkurler

Diolch

KöszönömArigato

Hvala

Toda



Open Machine Learning

Technology

Transcript of Open Machine Learning