Applying Machine Learning to Open Data Sets to find new customers
Open Machine Learning
-
Upload
joaquin-vanschoren -
Category
Technology
-
view
356 -
download
0
description
Transcript of Open Machine Learning
the open experiment databasemeta-learning for the masses
Joaquin Vanschoren @joavanschoren
The Polymath story
Tim Gowers
Machine Learningare we doing it right?
Computer Science
• The scientific method• Make a hypothesis about the world
• Generate predictions based on this hypothesis
• Design experiments to verify/falsify the prediction
• Predictions verified: hypothesis might be true
• Predictions falsified: hypothesis is wrong
Computer Science
• The scientific method (for ML)• Make a hypothesis about (the structure of) given data
• Generate models based on this hypothesis
• Design experiments to measure accuracy of the models
• Good performance: It works (on this data)
• Bad performance: It doesn’t work on this data
• Aggregates (it works 60% of the time) not useful
Computer Science
• The scientific method (for ML)• Make a hypothesis about (the structure of) given data
• Generate models based on this hypothesis
• Design experiments to measure accuracy of the models
• Good performance: It works (on this data)
• Bad performance: It doesn’t work on this data
• Aggregates (it works 60% of the time) not usefulHow can data be characterized on
which the algorithm works well?
Computer Science
• The scientific method (for ML)• Make a hypothesis about (the structure of) given data
• Generate models based on this hypothesis
• Design experiments to measure accuracy of the models
• Good performance: It works (on this data)
• Bad performance: It doesn’t work on this data
• Aggregates (it works 60% of the time) not usefulHow can data be characterized on
which the algorithm works well? What is the effect of
parameter settings?
Meta-Learning
• The science of understanding which algorithms work well on which types of data
• Hard: thorough understanding of data and algorithms
• Requires good data: extensive experimentation
• Why is this separate from other ML research?• A thorough algorithm evaluation = a meta-learning study
• Original authors know algorithms and data best, have large sets of experiments, are (presumably) interested in knowing on which data their algorithms work well (or not)
Meta-Learning
With the right tools, can we make everyone a meta-learner?
ML algorithm design meta-learning
Large sets of experiments algorithm selection
algorithm characterizationdata characterization
bias-variance analysis
learning curvesdata insight
algorithm insight
algorithm comparisondatasets
source code
Open Machine Learning
Open science
World-wide Telescope
Open science
Microarray Databases
Open science
GenBank
Open machine learning?
• We can also be `open’• Simple, common formats to describe experiments, workflows,
algorithms,...
• Platform to share, store, query, interact
• We can go (much) further• Share experiments automatically (open source ML tools)
• Experiment on-the-fly (cheap, no expensive instruments)
• Controlled experimentation (experimentation engine)
Formalizing machine learning
• Unique names for algorithms, datasets, evaluation measures, data characterizations,... (ontology)
• Based on DMOP, OntoDM, KDOntology, EXPO,...
• Simple, structured way to describe algorithm setups, workflows and experiment runs
• Detailed enough to reproduce all experiments
Run
run
Run
run
Execution of a predefined setup
Run
run
Execution of a predefined setup
setup
Run
setup
run
Run
in
setup
data run
Run
in
setup
data
machine
run
Run
in out
setup
data data
machine
run
Run
in out
setup
data data
machine
run
Also: start time author status,...
Setup
setup
Setup
Plan of what we want to do
setup
Setup
Plan of what we want to do
setup
f(x)algorithm
setupfunction
setupwork!ow experiment
Setup
setup
f(x)algorithm
setupfunction
setupwork!ow experiment
part of
Hierarchical
Setup
setup
f(x)algorithm
setupfunction
setupwork!ow experiment
part ofp=!
parameter setting
HierarchicalParameterized
Setup
setup
f(x)algorithm
setupfunction
setupwork!ow experiment
part ofp=!
parameter setting
HierarchicalParameterized
Abstract/concrete
Algorithm Setup
algorithmsetup
Algorithm Setup
Fully defined algorithm configuration
algorithmsetup
part of
Algorithm Setup
Fully defined algorithm configuration
algorithmsetup
p=!parameter settingimplementation
part of
f(x)function
setup
Algorithm Setup
Fully defined algorithm configuration
algorithmsetup
p=!parameter settingimplementation
part of
f(x)function
setup
Algorithm Setup
algorithmsetup
p=!parameter setting
part of
f(x)function
setupimplementation
Algorithm Setup
algorithmsetup
p=!
algorithm
parameter setting
algorithm quality
part of
f(x)function
setup
p=?parameter
f(x)mathematical function
implementation
Algorithm Setup
algorithmsetup
p=!
algorithm
parameter setting
algorithm quality
part of
f(x)function
setup
p=?parameter
f(x)mathematical function
implementation
unique names
Algorithm Setup
algorithmsetup
p=!
algorithm
parameter setting
algorithm quality
part of
f(x)function
setup
p=?parameter
f(x)mathematical function
implementation
unique names
Roles: learner, base-learner, kernel,...
Setup
setup
f(x)algorithm
setupfunction
setupwork!ow experiment
part of
Workflow Setup
setup
algorithmsetup
work!ow
part of
Workflow Setup
setup
algorithmsetup
work!ow
part of
source
connection
target
Workflow: components, connections, and parameters (inputs)
Workflow Setup
setup
algorithmsetup
work!ow
part of
source
connection
target
Workflow: components, connections, and parameters (inputs)
Also: ports datatype
WorkflowExample
Weka.ARFFLoader
p=! location= http://...
2:loadData
Weka.Evaluation
p=! F=10
3:crossValidate
Weka.SMO
p=! C=0.01
4:learner
Weka.RBF
f(x) 5:kernel
p=! G=0.01
p=! S=1
data
data
eval
pred
url evalu-ations
predic-tions
par
logRuns=true logRuns=falselogRuns=true
1:mainFlow
WorkflowExample
Weka.ARFFLoader
p=! location= http://...
2:loadData
Weka.Evaluation
p=! F=10
3:crossValidate
Weka.SMO
p=! C=0.01
4:learner
Weka.RBF
f(x) 5:kernel
p=! G=0.01
p=! S=1
data
data
eval
pred
url evalu-ations
predic-tions
par
logRuns=true logRuns=falselogRuns=true
1:mainFlow
86
Evaluations
7 Predictions
data data evalpred
predictions
evaluations
Weka.Instances
Setup
setup
f(x)algorithm
setupfunction
setupwork!ow experiment
part of
ExperimentSetup
setup
algorithmsetup
work!ow experiment
part of
<X>experiment
variable
ExperimentSetup
setup
algorithmsetup
work!ow experiment
part of
<X>experiment
variable
setup
Also: experiment design, description, literature reference, author,...
Experiment Setup
Experiment SetupVariables: labeled tuples which can be
referenced in setups
Run
in out
setup
data data
machine
run
Also: start time author status,...
Run
data
dataset evaluation model predictions
Run
sourcedata run
dataset evaluation model predictions
Run
sourcedata run
dataset evaluation model predictions
data quality
EXPMLWeka.ARFFLoader
p=! location= http://...
2:loadData
Weka.Evaluation
p=! F=10
3:crossValidate
Weka.SMO
p=! C=0.01
4:learner
Weka.RBF
f(x) 5:kernel
p=! G=0.01
p=! S=1
data
data
eval
pred
url evalu-ations
predic-tions
par
logRuns=true logRuns=falselogRuns=true
1:mainFlow
Demo(preview)
Learning curves
0.2$
0.3$
0.4$
0.5$
0.6$
0.7$
0.8$
0.9$
1$
10$ 20$ 30$ 40$ 50$ 60$ 70$ 80$ 90$ 100$
pred
ic've)accuracy)
percentage)of)original)dataset)size)
RandomForest$C45$Logis<cRegression$RacedIncrementalLogitBoostAStump$NaiveBayes$SVMARBF$
Examples
When does one algorithm outperform another?
Examples
When does one algorithm outperform another?
Examples
Bias-variance profile + effect of dataset size
Examples
Bias-variance profile + effect of dataset size
boosting
bagging
Examples
Bias-variance profile + effect of dataset size
Examples
Taking it furtherSeamless integration
• Webservice for sharing, querying experiments
• Integrate experiment sharing in ML tools (WEKA, KNIME, RapidMiner, R, ....)
• Mapping implementations, evaluation measures,...
• Online platform for custom querying, community interaction
• Semantic wiki: algorithm/data descriptions, rankings, ...
Experimentation Engine
• Controlled experimentation (Delve, MLComp)• Download datasets, build training/test sets
• Feed training and test sets to algorithms, retrieve predictions/models
• Run broad set of evaluation measures
• Benchmarking (Cross-Validation), learning curve analysis, bias-variance analysis, workflows(!)
• Compute data properties for new datasets
Why would you use it?(seeding)
• Let the system run the experiments for you
• Immediate, highly detailed benchmarks (no repeats)
• Up to date, detailed results (vs. static, aggregated in journals)
• All your results organized online (private?), anytime, anywhere
• Interact with people (weird results?)
• Get credit for all your results (e.g. citations), unexpected results
• Visibility, new collaborations
• Check if your algorithm really the best (e.g. active testing)
• On which datasets does it perform well/badly?
Question
Is open
machine learning possible?
http://expdb.cs.kuleuven.be
Thanks
Gracias
Xie XieDanke
Dank U
Merci
Efharisto
Dhanyavaad
GrazieSpasiba
Kia oraTesekkurler
Diolch
KöszönömArigato
Hvala
Toda