Quick presentation for the OpenML workshop in Eindhoven 2014
-
Upload
manuel-martin -
Category
Data & Analytics
-
view
67 -
download
3
description
Transcript of Quick presentation for the OpenML workshop in Eindhoven 2014
Manuel Martín Salvador@[email protected]
OpenML workshopEindhoven 21/10/2014
Background● MSc. Computer Engineering● Master in Soft Computing and Intelligent Systems
Currently ● PhD Student – Automatic and adaptive pre-processing for building
predictive models● Teaching – Data Mining lab
Data preparation and pre-processing
Data preparation and pre-processing
Data preparation and pre-processing
Labour intensive tasks(up to 80% of a data mining process)
Automating pre-processing
A lot of available techniques
No free lunch
Multiple combinations
Order of pre-processing methods matters
No semantic → some approaches use ontologies
Meta-learning → needs a good database of experiments
Scientific workflow platforms and repositories with experiments
Software Repository Applications
DiscoveryNet (inactive) -
Kepler - Various
Taverna MyExperiment (open) Bioinformatics
Pegasus - Various
Galaxy - Biomedical
Pipeline Pilot Accelrys (commercial)
* MLComp (“open”) Machine Learning
Weka,MOA,R,RapidMiner OpenML (open) Machine Learning
OpenML statistics
Datasets: 1042Tasks: 3025Flows: 640Runs: 31540
Valid: 24410With errors: 7130Datasets: 300Individual components: 136Paired components: 635“Flow size”: 1 – 8198
2 – 12178 3 – 1993 4 – 1533 5 – 502 6 – 6
0
500
1000
1500
2000
2500
3000
3500
4000
4500
Distribution of components
0
200
400
600
800
1000
1200
1400
1600
Distribution of datasets
Only 3 Weka filters:Principal Components, Discretize, PLSFilter
TO DO
How to increase the number of pre-processing methods in OpenML?- The only way right now is using FilteredClassifier in Weka- What about R, MOA, RapidMiner?
Improving flow representation- Right now is difficult to see how components are connected- Clear distinction of parameters- What about including Weka flows (XML based) and ADAMS flows?- PMML support?
Statistics for available data, tasks, flows and runs
Flow recommendation system for a given dataset[dataset, data characteristics, prediction accuracy, flow_id]
Flow validation before executing it[dataset, data characteristics, flow characteristics, failure]
A little bit further
Adapting flows while processing data streams
- Detecting changes in data characteristics
- Locally checking input/output in each flow component
- Change propagation
- Reducing cost of adaptation
Photos CC by Cristina Granados
Visit us!Data Science Institute @ Bournemouth University