SC16: Helping HPC Users Specify Job Memory Requirements via Machine Learning
Intel HPC Developer Convention Salt Lake City 2016 Machine ... · Intel HPC Developer Convention...
Transcript of Intel HPC Developer Convention Salt Lake City 2016 Machine ... · Intel HPC Developer Convention...
Intel HPC Developer Convention Salt Lake City 2016
Machine Learning Track
Franz J. Király
Data Analytics, Machine Learning
and HPC in today’s changing
application environment
An overview of data analytics
DATA Scientific
Questions
Exploration Statistical
Questions
Methods
Quantitative Modelling
Predictive/Inferential Descriptive/Explanatory Sta
tistical P
rogra
mm
ing
R python
The S
cie
ntific
Meth
od
Scientific and Statistical Validation
Knowledge
(practical)
Data analytics and data science in a broader context
Data analytics
Data mining,
Machine learning
Statistics, Modelling,
Raw
data
Clean
data
Lot of problems and subtleties
at these stages already
often, most of manpower
in „data“ project needs
to go here first before
one can attempt reliable
Knowledge
underlying arguments
need to be explained
well and properly
Relevant findings and
Big Data?
What „Big Data“ may mean in practice
Kernel methods, OLS
10.000
Solution strategies Number of data samples
Strategies that
stop working
in reasonable time
Number of features
10.000.000
10.000.000.000
1.000
Reading in all the data
Random forests
100
L1, LASSO (around the same order)
Manual exploratory
data analysis
1.000
Super-linear algorithms
Linear algorithms, including
Sub-sampling
On-line models
Feature extraction
Large-scale strategies
for super-linear algorithms
Feature selection
Distributed computing
Large-scale motifs in data science
Not necessarily a lot of data, but computationally intensive models
Classical example: finite elements and other numerical models
„Big models“
New fancy example: large neural networks aka „deep learning“
= where high-performance computing is helpful/impactful
Computational challenge arises from processing all of the data
Example: histogram or linear regression with huge amounts of data
„Big data“
Common HPC motif: divide/conquer in parts-of-model, e.g. neurons/nodes
= the „classic“, beloved by everyone
= what it says, a lot of data (ca 1 million samples or more)
Common HPC motif: divide/conquer training/fitting of model, e.g. batchwise/epoch fitting
Model validation and model selection = this talk‘s focus
Answers the question: which model is best for your data?
Demanding even for simple models and small amounts of data!
Example: is deep learning better than logistic regression, or guessing?
Customer: Hospital specializing in treatment of patients with a certain disease.
Meta-modelling: stylized case studies
Scientific question: depending on patient characteristics, predict the event risk.
Patients with this disease are at-risk to experience an adverse event (e.g. death)
Data set: complete clinical records of 1.000 patients, including event if occurred
Customer: Retailer who wants to accurately model behaviour of customers.
Not of interest: which algorithm/strategy, out of many, exactly solves the task
Scientific question: predict future customer behaviour given past behaviour
Customers can buy (or not buy) any of a number of products, or churn.
Data set: complete customer and purchase records of 100.000 customers
Of interest: model interpretability; how accurate the predictions are expected to be
Customer: Manufacturer wishes to find best parameter setting for machines.
Scientific question: find parameter settings which optimizes the above
Parameters influence amount/quality of product (or whether machine breaks)
Data set: outcomes for 10.000 parameter settings on those machines
whether the algorithm/model is (easily) deployable in the „real world“
= data-centric and data-dependent modelling
Model validation and model selection
1. There is no model that is good for all data.
2. For given data, there is no a-priori reason to believe
that a certain type of model will be the best one.
(otherwise the justification of validity is circular hence faulty)
a scientific necessity implied by the scientific method and the following:
Machine learning provides algorithms & theory for meta-modelling
(otherwise the concept of a model would be unnecessary)
(any such belief is not empirically justified hence pseudoscientific)
3. No model can be trusted unless its validity has
been verified by a model-independent argument.
and powerful algorithms motivated by meta-modelling optimality.
Machine Learning
and Meta-Modelling
in a Nutshell
modelling
strategy
Leitmotifs of Machine Learning
Statistical models are objects in their own right
„learning
machines“ modelling
strategy
Engineering & statistics idea:
Engineering & computer science idea:
Computer science & statistics idea:
Any abstract algorithm can be a modelling strategy/learning machine
Future performance of algorithm/learning machine can be estimated
„model validation“
„model selection“
„computational
learning“
from the intersection of engineering, statistics and computer science
Possibly non-explicit
(and should)
learning
machine ?
Problem types in Machine Learning
? ? ?
Supervised Learning:
some data is labelled by expert/oracle
Task: predict label from covariates
statistical models are usually discriminative
Examples: regression, classification
Problem types in Machine Learning
? ? !
Unsupervised Learning:
the training data is not pre-labelled
Task: find „structure“ or „pattern“ in data
statistical models are usually generative
Examples: clustering, dimension reduction
Advanced learning tasks
Semi-supervised learning
some training data are labelled, some are not
On-line learning
the data is revealed with time, models need to update
Anomaly detection
all or most data are „positive examples“, the task is to flag „test negatives“
Complications in the labelling
Complications through correlated data and/or time
Forecasting
each data point has a time stamp, predict the temporal future
Transfer learning
the data comes in dissimilar batches, train and test may be distinct
Reinforcement learning
data are not directly labelled, only indirect gain/loss
observations
„training data“
predictions
model fitting
“learning”
fitted model
prediction
new data
??
model tuning parameters
e.g., to base
decisions on
What is a Learning Machine?
Examples: generalized linear model, linear regression, support vector machine,
neural networks (= „deep learning“), random forests, gradient boosting, …
… an algorithm that solves,
e.g., the previous tasks:
Illustration: supervised learning machine
Example: Linear Regression
observations
„training data“
predictions
model fitting
“learning”
fitted model
prediction
new data
?
Fit intercept or not?
Model validation: does the model make sense?
Model
learning
Prediction
„the truth“
„training data“
„test data“
e.g. regression, GLM,
advanced methods learnt model
? „test labels“
compare
&
quantify „out-of-sample“
„hold-out “
„in-sample“
Predictive models need to be validated on unseen data!
Which means the part of data for testing has not been seen by the algorithm before!
(note: this includes the case where machine = linear regression, deep learning, etc)
The only (general) way to test goodness of prediction is actually observing prediction!
??
predictions e.g. evaluating the
regression model
prediction strategy
learning machine
„Re-sampling“: training data 1
test data
Predictor 1
Predictor 2
Predictor 3 training data 2
test data
Predictor 1
Predictor 2
Predictor 3 training data 3
test data 3
Predictor 1
Predictor 2
Predictor 3
all data
errors 1,2,3
errors 1,2,3
errors 1,2,3
aggregate
errors 1,2,3
comparison
k-fold cross-validation
how to obtain training/test splits type of re-sampling pros/cons
2. obtain k train/tests splits via:
1. divide data in k (almost) equal parts
each part is test data exactly once
the rest of data is the training set often: k=5
good compromise between
runtime and accuracy
Multiple algorithms are compared on multiple data splits/sub-datasets
leave-one-out
when k is small compared to data size
= [number of data points]-fold c.v. very accurate, high run-time
repeated
sub-sampling
parameters: training/test size
# of repetitions
1. obtain a random sub-sample of
training/test data of specified sizes
(train/test need not cover all data)
can be arbitrarily quick
can be arbitrarily inaccurate
(depending on parameter choice)
2. repeat 1. desired number of times can be combined with k-fold
State-of-art principle in model validation, model comparison and meta-modelling
Quantitative model comparison a „benchmarking experiment“ results in a table like this
model RMSE
15.3
? Confidence regions (or paired tests) to compare models to each other:
A is better than B / B is better than A / A and B are equally good
Uninformed model (stupid model/random guess) needs to be included
otherwise a statement „is better than an uninformed guess“ cannot be made.
9.5
13.6
20.1 ± 1.2
± 0.9
± 0.7
± 1.4
MAE
12.3
7.3
11.4
18.1 ± 1.1
± 0.8
± 0.9
± 1.7
„useful model“ = (significantly) better than uninformed baseline
Meta-model: automated parameter tuning
training data
test data
Parameters 1
Parameters 2
Parameters 3
mo
del
goodn
ess
1
5
.
3
?
9
.
5 1
3
.
6 2
0
.
1
±
1
.
2
±
0
.
9
±
0
.
7
±
1
.
4
Best parameters
whole training data
Re-sampled training data
Important caveat:
Which measure of predictive goodness
Which inner re-sampling scheme
Methods are usually less sensitive
to these „new“ tuning parameters
the „inner“ training/test splits
need to be part of any „outer“ training set
otherwise validation is not out-of-sample!
Re-sampling is used to determine [best parameter setting]
For validation, new unseen data needs to be used:
all data
training
data
test data
tuning train
tuning test
„real“ test
model goodness
1
5
.
3
?
9
.
5
1
3
.
6
2
0
.
1
±
1
.
2
±
0
.
9
±
0
.
7
±
1
.
4
Model w. Best
Parameter
training
data
fit to all
predict &
quantify
Multi-fold-schemes are nested: „splits within splits“
Meta-Strategies in ML „Model
tuning“
Model with tuning parameters Best tuning parameters are determined
using data-driven tuning algorithm
„Ensemble
learning“
A B
C
D
a number of (possibly „weak“) models
A D B
„strong“ ensemble model
Object dependencies in the ML workflow
all data One interesting dataset
into multiple
train/test splits training
data test data
is re-sampled
training
data test data
training
data test data
„Typical
number of“
5-10
on each
of which
the strategies
are compared 1 2 M M = 5-20
most of which
are parameter-
tuned by the
same principle
10-10.000
parameter
combinations Ensembles: further nesting
10-1.000
base learners
Runtime = 10 x 10 x 5 x 1.000 (x 100) x one run on N samples
3-5 nested splits
outer
splits
N = 100-100.000
data points („small data“)
(usually O(N²) or O(N³) )
Machine Learning
Toolboxes
An incomplete list of influential toolboxes
Modular API (e.g., methods)
Model tuning,
meta-methods
Model validation
and comparison GUI Language
R
caret
python
multi-
interface
R
Java 3rd party
wrappers
python
Common
models
Not
entirely
scikit-learn is perhaps the most widely used ML toolbox
mostly
kernels some
Few, mostly
classifiers few
python
The object-oriented ML Toolbox API
Learning Machines
as found in the R/mlr or scikit-learn packages
Leading principles: encapsulation, modularization
modular structure
Linear regression
fit(traindata)
„learning machine“ object
predict(testdata)
plus metadata & model info
object orientation
Abstraction models objects with unified API:
Public interface Concept abstracted in R/mlr in sklearn
fitting, predicting, set parameters Learner estimator
Re-sampling schemes sample, apply & get results ResampleDesc splitter classes in model_selection
Evaluation metrics compute from results, tabulate Measure metrics classes in metrics
Meta-modelling wrapping machines by strategy
Learning task benchmark, list strategies/measures Task Implicit, not
encapsulated
Tuning Ensembling Pipelining Pipeline
various wrappers various wrappers
fused classes
HPC for benchmarking/validation today
all data
Scikit-learn: joblib
training
data test data training
data test data
training
data test data
„Typical
number of“
5-10
1 2 M M = 5-20
10-10.000
parameter
combinations
10-1.000
base learners
Plus algorithm-specific HPC interfaces, e.g. deep learning (mutually exclusive)
3-5 nested splits
outer
splits
N = 100-100.000
data points („small data“)
mlr: parallelMap
1
2
3
4
At the
selected
level:
Distribute to
clusters/cores
(one of 1-4)
HPC support tomorrow?
1 2 M
Layer 2:
Layer 1:
full graph of
dependencies: re-samples
algorithms
parameters
…
Scheduler for
algorithms and
meta-algorithms
Data
/task p
ipelin
e
DATA (e.g. Hadoop)
Layer 3:
Optimized
Primitives
Layer 4:
Hardware API
Combining (?)
MapReduce,
DAAL, dask,
joblib -> TBB?
e.g. MKL,
CUDA,
BLAS
e.g. distributed, multi-core,
multi-type/heterogeneous
(image source: continuum analytics)
Linear systems
convex optimization
stoch. gradient descent (image source: Intel math kernel library)
Challenges in ML APIs and HPC Surprisingly few resources have been invested in ML toolboxes
Most advanced toolboxes are currently open-source & academic
Features that would be desirable to the practitioner
but not available without mid-scale software development:
Integration of (a) data management, (b) exploration and (c) modelling
Full HPC integration on granular level for distributed ML benchmarking
Non-standard modelling tasks, structured data (incl time series)
data heterogeneity, multiple datasets, time series, spatial features, images etc
forecasting, on-line learning, anomaly detection, change point detection
especially challenging: integration in large scale scenarios
e.g. MapReduce for divide/conquer over data, model parts, and models
making full use parallelism for nesting and computational redundancies
complete HPC architecture for whole model benchmarking workflow
meta-modelling and re-sampling for these is an order of magnitude more costly