Summer School 2014 Learning at Scale - Aarhus Universitet
Transcript of Summer School 2014 Learning at Scale - Aarhus Universitet
![Page 1: Summer School 2014 Learning at Scale - Aarhus Universitet](https://reader031.fdocuments.in/reader031/viewer/2022012002/61d905e224dae571026394c3/html5/thumbnails/1.jpg)
Summer School 2014Learning at Scale
![Page 2: Summer School 2014 Learning at Scale - Aarhus Universitet](https://reader031.fdocuments.in/reader031/viewer/2022012002/61d905e224dae571026394c3/html5/thumbnails/2.jpg)
Summer School 2014: Learning at Scale
Machine Learning is......the branch of engineering that develops technology for automatedinference
--- Cosma Shalizi
It combines...
![Page 3: Summer School 2014 Learning at Scale - Aarhus Universitet](https://reader031.fdocuments.in/reader031/viewer/2022012002/61d905e224dae571026394c3/html5/thumbnails/3.jpg)
Summer School 2014: Learning at Scale
Machine Learning is......the branch of engineering that develops technology for automatedinference
--- Cosma Shalizi
It combines...
Algorithms
![Page 4: Summer School 2014 Learning at Scale - Aarhus Universitet](https://reader031.fdocuments.in/reader031/viewer/2022012002/61d905e224dae571026394c3/html5/thumbnails/4.jpg)
Summer School 2014: Learning at Scale
Machine Learning is......the branch of engineering that develops technology for automatedinference
--- Cosma Shalizi
It combines...
Algorithms Statistics
![Page 5: Summer School 2014 Learning at Scale - Aarhus Universitet](https://reader031.fdocuments.in/reader031/viewer/2022012002/61d905e224dae571026394c3/html5/thumbnails/5.jpg)
Summer School 2014: Learning at Scale
Machine Learning is......the branch of engineering that develops technology for automatedinference
--- Cosma Shalizi
It combines...
Algorithms Statistics Optimization
![Page 6: Summer School 2014 Learning at Scale - Aarhus Universitet](https://reader031.fdocuments.in/reader031/viewer/2022012002/61d905e224dae571026394c3/html5/thumbnails/6.jpg)
Summer School 2014: Learning at Scale
Machine Learning is......the branch of engineering that develops technology for automatedinference
--- Cosma Shalizi
It combines...
Algorithms Statistics Optimization
Geometry
![Page 7: Summer School 2014 Learning at Scale - Aarhus Universitet](https://reader031.fdocuments.in/reader031/viewer/2022012002/61d905e224dae571026394c3/html5/thumbnails/7.jpg)
Summer School 2014: Learning at Scale
Flavors of Learning
![Page 8: Summer School 2014 Learning at Scale - Aarhus Universitet](https://reader031.fdocuments.in/reader031/viewer/2022012002/61d905e224dae571026394c3/html5/thumbnails/8.jpg)
Summer School 2014: Learning at Scale
Supervised Learning
(Binary) classification
Given: {(x1
, y1), (x2
, y2), . . .}
x
i
2 Rd
Find a function f : Rd 7! {+1,�1}
8i, f (x
i
) = y
i
drawn from some source
such that
f captures the relationship between and x y
yi 2 {+1,�1}
![Page 9: Summer School 2014 Learning at Scale - Aarhus Universitet](https://reader031.fdocuments.in/reader031/viewer/2022012002/61d905e224dae571026394c3/html5/thumbnails/9.jpg)
Summer School 2014: Learning at Scale
Supervised Learning
(Binary) classification
Given: {(x1
, y1), (x2
, y2), . . .}
x
i
2 Rd
Find a function f : Rd 7! {+1,�1}
8i, f (x
i
) = y
i
drawn from some source
such that
f captures the relationship between and x y
yi 2 {+1,�1}
![Page 10: Summer School 2014 Learning at Scale - Aarhus Universitet](https://reader031.fdocuments.in/reader031/viewer/2022012002/61d905e224dae571026394c3/html5/thumbnails/10.jpg)
Summer School 2014: Learning at Scale
Supervised Learning
(Binary) classification
Given: {(x1
, y1), (x2
, y2), . . .}
x
i
2 Rd
Find a function f : Rd 7! {+1,�1}
8i, f (x
i
) = y
i
drawn from some source
such that
f captures the relationship between and x y
yi 2 {+1,�1}
• Spam: Testing if email is spam or not• Sentiment analysis: is a product review positive or negative
![Page 11: Summer School 2014 Learning at Scale - Aarhus Universitet](https://reader031.fdocuments.in/reader031/viewer/2022012002/61d905e224dae571026394c3/html5/thumbnails/11.jpg)
Summer School 2014: Learning at Scale
Supervised Learning
Regression
Given: {(x1
, y1), (x2
, y2), . . .}
x
i
2 Rd
Find a function
8i, f (x
i
) = y
i
drawn from some source
such that f captures the relationship between and x y
f : Rd 7! R
yi 2 R
![Page 12: Summer School 2014 Learning at Scale - Aarhus Universitet](https://reader031.fdocuments.in/reader031/viewer/2022012002/61d905e224dae571026394c3/html5/thumbnails/12.jpg)
Summer School 2014: Learning at Scale
Supervised Learning
Regression
Given: {(x1
, y1), (x2
, y2), . . .}
x
i
2 Rd
Find a function
8i, f (x
i
) = y
i
drawn from some source
such that f captures the relationship between and x y
f : Rd 7! R
yi 2 R
![Page 13: Summer School 2014 Learning at Scale - Aarhus Universitet](https://reader031.fdocuments.in/reader031/viewer/2022012002/61d905e224dae571026394c3/html5/thumbnails/13.jpg)
Summer School 2014: Learning at Scale
Supervised Learning
Regression
Given: {(x1
, y1), (x2
, y2), . . .}
x
i
2 Rd
Find a function
8i, f (x
i
) = y
i
drawn from some source
such that f captures the relationship between and x y
f : Rd 7! R
yi 2 R
• Predictions: Stock market price as function of financial specs• Relationship between dosage and effectiveness
![Page 14: Summer School 2014 Learning at Scale - Aarhus Universitet](https://reader031.fdocuments.in/reader031/viewer/2022012002/61d905e224dae571026394c3/html5/thumbnails/14.jpg)
Summer School 2014: Learning at Scale
Unsupervised LearningClustering
Given a collection of objects, find a way to "group" them into similar pieces
Learning a function f : Rd 7! {1, 2, . . . , k}
But we don't have any examples of the "correct" answer !Clustering is closely related to classification with multiple classes
![Page 15: Summer School 2014 Learning at Scale - Aarhus Universitet](https://reader031.fdocuments.in/reader031/viewer/2022012002/61d905e224dae571026394c3/html5/thumbnails/15.jpg)
Summer School 2014: Learning at Scale
Unsupervised LearningClustering
Given a collection of objects, find a way to "group" them into similar pieces
Learning a function f : Rd 7! {1, 2, . . . , k}
But we don't have any examples of the "correct" answer !Clustering is closely related to classification with multiple classes
![Page 16: Summer School 2014 Learning at Scale - Aarhus Universitet](https://reader031.fdocuments.in/reader031/viewer/2022012002/61d905e224dae571026394c3/html5/thumbnails/16.jpg)
Summer School 2014: Learning at Scale
Unsupervised LearningDimensionality Reduction (or Feature Learning)
Given objects in Rd find a mapping A : Rd 7! Rk, k ⌧ d
that preserves the "structure" of the objects
• Find "relevant" dimensions for a task• Reduce dimensionality to manage complexity of algorithms
![Page 17: Summer School 2014 Learning at Scale - Aarhus Universitet](https://reader031.fdocuments.in/reader031/viewer/2022012002/61d905e224dae571026394c3/html5/thumbnails/17.jpg)
Summer School 2014: Learning at Scale
Mixing and Matching
Semi-supervised learning: labelled and unlabeled data
Find a classifier that separates the labeled points and separates the unlabeled points "well"
Often have lots of unlabeled data and only a little labeled data to guide efforts
![Page 18: Summer School 2014 Learning at Scale - Aarhus Universitet](https://reader031.fdocuments.in/reader031/viewer/2022012002/61d905e224dae571026394c3/html5/thumbnails/18.jpg)
Summer School 2014: Learning at Scale
Mixing and Matching
Supervised clustering = multiclass classification
Supervised dimensionality reduction = (linear) discriminant analysis
![Page 19: Summer School 2014 Learning at Scale - Aarhus Universitet](https://reader031.fdocuments.in/reader031/viewer/2022012002/61d905e224dae571026394c3/html5/thumbnails/19.jpg)
Summer School 2014: Learning at Scale
Understanding vs Predicting
{(x1
, y1), (x2
, y2), . . .} is drawn from distribution p(X, Y)
Generative learning: Learn the distribution p(X, Y)
"What controls the rise and fall of the tides ?"
Discriminative learning: Learn the conditional distribution p(Y | X)
p(Y | X) =p(X, Y)
p(X)
"Will there be a high tide tomorrow evening ?"
![Page 20: Summer School 2014 Learning at Scale - Aarhus Universitet](https://reader031.fdocuments.in/reader031/viewer/2022012002/61d905e224dae571026394c3/html5/thumbnails/20.jpg)
Summer School 2014: Learning at Scale
Understanding vs PredictingDiscriminative clustering: predict the cluster of a new point
p( | ) = exp(�k � k2)
![Page 21: Summer School 2014 Learning at Scale - Aarhus Universitet](https://reader031.fdocuments.in/reader031/viewer/2022012002/61d905e224dae571026394c3/html5/thumbnails/21.jpg)
Summer School 2014: Learning at Scale
Understanding vs PredictingDiscriminative clustering: predict the cluster of a new point
p( | ) = exp(�k � k2)
Generative clustering: mixture density estimation
![Page 22: Summer School 2014 Learning at Scale - Aarhus Universitet](https://reader031.fdocuments.in/reader031/viewer/2022012002/61d905e224dae571026394c3/html5/thumbnails/22.jpg)
Summer School 2014: Learning at Scale
Parameters
Parametric Learning:• Define a space of models parametrized by fixed number of parameters• Find model that best fits the data (by searching over parameters)
Parametric binary classification:• Model: • Maximize likelihood of any model• parameters
(µ1, S1, µ2, S2) p(x) µ exp(�(x � µ)>S(x � µ))
d2 + 2d
![Page 23: Summer School 2014 Learning at Scale - Aarhus Universitet](https://reader031.fdocuments.in/reader031/viewer/2022012002/61d905e224dae571026394c3/html5/thumbnails/23.jpg)
Summer School 2014: Learning at Scale
Parameters
Non-parametric Learning:• Define a space of models that can grow in size with data. • Find model that best fits the data• "Non-parametric" means "Not-fixed", not "none" !
4 "support points" define the resulting classifier.
![Page 24: Summer School 2014 Learning at Scale - Aarhus Universitet](https://reader031.fdocuments.in/reader031/viewer/2022012002/61d905e224dae571026394c3/html5/thumbnails/24.jpg)
Summer School 2014: Learning at Scale
Bayesian Learning
Non-Bayesian (parametric) learning:
{Q}
{(x1
, y1), (x2
, y2), . . .}Learner Q⇤
![Page 25: Summer School 2014 Learning at Scale - Aarhus Universitet](https://reader031.fdocuments.in/reader031/viewer/2022012002/61d905e224dae571026394c3/html5/thumbnails/25.jpg)
Summer School 2014: Learning at Scale
Bayesian Learning
Non-Bayesian (parametric) learning:
{Q}
{(x1
, y1), (x2
, y2), . . .}Learner Q⇤
p(Q)
{(x1
, y1), (x2
, y2), . . .}
Learner
Bayesian learning:
Prior (belief about the world)
p̂(Q)
Posterior (belief about the world)
Q⇤ is a point estimate. p̂(Q) is a distribution over possible worlds
![Page 26: Summer School 2014 Learning at Scale - Aarhus Universitet](https://reader031.fdocuments.in/reader031/viewer/2022012002/61d905e224dae571026394c3/html5/thumbnails/26.jpg)
Summer School 2014: Learning at Scale
Bayesian Learning
You know you're talking to a Bayesian if...
What's
your
prior ? Conjugate priors
Maximum a posteriori (MAP)
![Page 27: Summer School 2014 Learning at Scale - Aarhus Universitet](https://reader031.fdocuments.in/reader031/viewer/2022012002/61d905e224dae571026394c3/html5/thumbnails/27.jpg)
Summer School 2014: Learning at Scale
Many Learning Frameworks
• Online Learning: must make prediction as soon as item arrives• Active Learning: I can get labels for data, but it's expensive. • Multi-task Learning: I'm learning different tasks, but they're related
so maybe the tasks can learn from each other.• Transfer Learning: I can learn well in one domain: can I transfer this• knowledge into a different domain ? • Ensemble Learning: I have bad learners, but together they're decent
![Page 28: Summer School 2014 Learning at Scale - Aarhus Universitet](https://reader031.fdocuments.in/reader031/viewer/2022012002/61d905e224dae571026394c3/html5/thumbnails/28.jpg)
Summer School 2014: Learning at Scale
The Mechanics of Learning
![Page 29: Summer School 2014 Learning at Scale - Aarhus Universitet](https://reader031.fdocuments.in/reader031/viewer/2022012002/61d905e224dae571026394c3/html5/thumbnails/29.jpg)
Summer School 2014: Learning at Scale
Loss Functions
Find a function f : Rd 7! {+1,�1}
such that f captures the relationship between and x y
Given: {(x1
, y1), (x2
, y2), . . .} drawn from some source
x
i
2 Rd yi 2 {+1,�1}
Loss functions measure the quality of f: L( f ((x), y))
0-1 loss:
Hinge loss:
Square loss:
1 f (x) 6=y
max(0, 1 � y · f (x))
(y � f (x))2
![Page 30: Summer School 2014 Learning at Scale - Aarhus Universitet](https://reader031.fdocuments.in/reader031/viewer/2022012002/61d905e224dae571026394c3/html5/thumbnails/30.jpg)
Summer School 2014: Learning at Scale
Estimating Risk
Once we have a loss function, we can quantify how gooda predictor is:
R( f ) = Ex,y[L( f (x), y)] =
Zp(x, y)L( f (x), y)
and find a good predictor:
f ⇤ = arg minf2F
R( f )
But we don't usually know what the data distribution is, so we can'tsolve the minimization !
![Page 31: Summer School 2014 Learning at Scale - Aarhus Universitet](https://reader031.fdocuments.in/reader031/viewer/2022012002/61d905e224dae571026394c3/html5/thumbnails/31.jpg)
Summer School 2014: Learning at Scale
Empirical Risk Minimization
Assume the given data is drawn from the source distribution.Replace
R( f ) = Ex,y[L( f (x), y)] =
Zp(x, y)L( f (x), y)
by the empirical mean:
R̂( f ) =1n Â
iL( f (xi, yi)
with the hope that the estimate is unbiased and converges:
E[R̂( f )] = R( f ), R̂( f ) ! R( f )
But now we have a "normal" optimization:
minf2F
1n Â
iL( f (xi, yi)
![Page 32: Summer School 2014 Learning at Scale - Aarhus Universitet](https://reader031.fdocuments.in/reader031/viewer/2022012002/61d905e224dae571026394c3/html5/thumbnails/32.jpg)
Summer School 2014: Learning at Scale
Overfitting and Regularization
The problem with optimizing overthe data is that you can over-fit to your samples. (low bias)
This is bad because then yourpredictive power goes down (high variance) and you can't generalize
Complex models (with more parameters) can overfit. Penalize them !
minf2F
1n Â
iL( f (xi, yi) + c( f )
model complexity termThis is called regularization.
![Page 33: Summer School 2014 Learning at Scale - Aarhus Universitet](https://reader031.fdocuments.in/reader031/viewer/2022012002/61d905e224dae571026394c3/html5/thumbnails/33.jpg)
Summer School 2014: Learning at Scale
GeneralizationHow many samples of the data do you need for the empirically optimized answer to get close to the true answer ?
If your function space is "well behaved" (not too wiggly), then you don't need too many samples.
Well behaved: • VC dimension is small• Rademacher complexity is small• Fat shattering dimension is small• ... and others.
All of this assumes that you sample from the real distribution...
![Page 34: Summer School 2014 Learning at Scale - Aarhus Universitet](https://reader031.fdocuments.in/reader031/viewer/2022012002/61d905e224dae571026394c3/html5/thumbnails/34.jpg)
Summer School 2014: Learning at Scale
Overview... so far...
1) Choose a learning task (classification, clustering, regression, ...)2) Pick a convenient loss function3) Sample a sufficient number of points from a source4) Build an optimization using the data, the loss function, and any regularizers5) OPTIMIZE !!!6) Use learned model on new data to predict. 7) (if you're doing online learning, repeat)
ML = Design choices + careful optimization
Linear..Least squaresSemidefiniteConvex...
Submodular...
Gradient searchCoordinate descent
Interior point methodsNewton's method
![Page 35: Summer School 2014 Learning at Scale - Aarhus Universitet](https://reader031.fdocuments.in/reader031/viewer/2022012002/61d905e224dae571026394c3/html5/thumbnails/35.jpg)
Summer School 2014: Learning at Scale
RepresentationsThe Computational Geometry prayer:
Let P be a set of points in the plane. Amen.
![Page 36: Summer School 2014 Learning at Scale - Aarhus Universitet](https://reader031.fdocuments.in/reader031/viewer/2022012002/61d905e224dae571026394c3/html5/thumbnails/36.jpg)
Summer School 2014: Learning at Scale
RepresentationsThe Computational Geometry prayer:
Let P be a set of points in the plane. Amen.
Let G be a graph with n vertices and m edges
![Page 37: Summer School 2014 Learning at Scale - Aarhus Universitet](https://reader031.fdocuments.in/reader031/viewer/2022012002/61d905e224dae571026394c3/html5/thumbnails/37.jpg)
Summer School 2014: Learning at Scale
RepresentationsThe Computational Geometry prayer:
Let P be a set of points in the plane. Amen.
Let G be a graph with n vertices and m edges
Let S be a set of elements drawn from a universe U
![Page 38: Summer School 2014 Learning at Scale - Aarhus Universitet](https://reader031.fdocuments.in/reader031/viewer/2022012002/61d905e224dae571026394c3/html5/thumbnails/38.jpg)
Summer School 2014: Learning at Scale
RepresentationsThe Computational Geometry prayer:
Let P be a set of points in the plane. Amen.
Let G be a graph with n vertices and m edges
Let S be a set of elements drawn from a universe U
Let M be an m X n matrix of reals.
![Page 39: Summer School 2014 Learning at Scale - Aarhus Universitet](https://reader031.fdocuments.in/reader031/viewer/2022012002/61d905e224dae571026394c3/html5/thumbnails/39.jpg)
Summer School 2014: Learning at Scale
RepresentationsThe Computational Geometry prayer:
Let P be a set of points in the plane. Amen.
Let G be a graph with n vertices and m edges
Let S be a set of elements drawn from a universe U
Let M be an m X n matrix of reals.
But in learning, we don't have a "natural" representation.We have to CHOOSE one.
![Page 40: Summer School 2014 Learning at Scale - Aarhus Universitet](https://reader031.fdocuments.in/reader031/viewer/2022012002/61d905e224dae571026394c3/html5/thumbnails/40.jpg)
Summer School 2014: Learning at Scale
Representations help algorithms
Learning a circleseparating classescan be tricky
![Page 41: Summer School 2014 Learning at Scale - Aarhus Universitet](https://reader031.fdocuments.in/reader031/viewer/2022012002/61d905e224dae571026394c3/html5/thumbnails/41.jpg)
Summer School 2014: Learning at Scale
Representations help algorithms
Learning a circleseparating classescan be tricky
` : (x, y) 7! (x, y, x
2 + y
2)
If we change the representation
![Page 42: Summer School 2014 Learning at Scale - Aarhus Universitet](https://reader031.fdocuments.in/reader031/viewer/2022012002/61d905e224dae571026394c3/html5/thumbnails/42.jpg)
Summer School 2014: Learning at Scale
Representations help algorithms
Learning a circleseparating classescan be tricky
` : (x, y) 7! (x, y, x
2 + y
2)
If we change the representation
Circle separation becomeslinear separation !
![Page 43: Summer School 2014 Learning at Scale - Aarhus Universitet](https://reader031.fdocuments.in/reader031/viewer/2022012002/61d905e224dae571026394c3/html5/thumbnails/43.jpg)
Summer School 2014: Learning at Scale
Constructing a representationSupervision guides the representation
Unsupervised spectral representation
And many other kinds...
![Page 44: Summer School 2014 Learning at Scale - Aarhus Universitet](https://reader031.fdocuments.in/reader031/viewer/2022012002/61d905e224dae571026394c3/html5/thumbnails/44.jpg)
Summer School 2014: Learning at Scale
A new overview of learningBuild a model of
data
Choose a learning task
Shape your learner (loss, regularizer)
Choose samples from a source
Optimize !
Predict !
![Page 45: Summer School 2014 Learning at Scale - Aarhus Universitet](https://reader031.fdocuments.in/reader031/viewer/2022012002/61d905e224dae571026394c3/html5/thumbnails/45.jpg)
Summer School 2014: Learning at Scale
A new overview of learningBuild a model of
data
Choose a learning task
Shape your learner (loss, regularizer)
Choose samples from a source
Optimize !
Predict !
Kernels
![Page 46: Summer School 2014 Learning at Scale - Aarhus Universitet](https://reader031.fdocuments.in/reader031/viewer/2022012002/61d905e224dae571026394c3/html5/thumbnails/46.jpg)
Summer School 2014: Learning at Scale
A new overview of learningBuild a model of
data
Choose a learning task
Shape your learner (loss, regularizer)
Choose samples from a source
Optimize !
Predict !
Kernels
(Submodular) optimization
Provably efficient
algorithms
![Page 47: Summer School 2014 Learning at Scale - Aarhus Universitet](https://reader031.fdocuments.in/reader031/viewer/2022012002/61d905e224dae571026394c3/html5/thumbnails/47.jpg)
Summer School 2014: Learning at Scale
A new overview of learningBuild a model of
data
Choose a learning task
Shape your learner (loss, regularizer)
Choose samples from a source
Optimize !
Predict !
Kernels
(Submodular) optimization
Provably efficient
algorithms
Doing all of this at scale