Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the...
Transcript of Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the...
Optimization for Machine Learning
Introduction into supervised learning
Lecturer: Robert M. Gower
Master IASD: AI Systems and Data Science, 2019
Core Info
● Where: ENS: 07/11 amphi Langevin, 03/12 U209, 05/12 amphi Langevin.
● Online: Teaching materials for these 3 classes: https://gowerrobert.github.io/
● Google docs with course info: Can also be found on https://gowerrobert.github.io/
Outline of my three classes
● 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for ridge regression
● 03/12/19 SGD for convex optimization. Theory and variants
● 05/12/19 Lab on SGD and variants
Detailed Outline today
● 13:30 – 14:00: Introduction to empirical risk minimization and classification and SGD
● 14:00 – 15:00 Revision on probability● 15:00 – 15:30: Tea Time! Break● 15:30 – 17:00: Exercises and proof of convergence of SGD for
ridge regression
An Introduction to Supervised Learning
References classes today
Convex Optimization, Stephen Boyd
Pages 67 to 79
Understanding Machine Learning: From Theory to Algorithms
Chapter 2
Is There a Cat in the Photo?
Yes
No
Is There a Cat in the Photo?
Yes
Is There a Cat in the Photo?
Yes
Is There a Cat in the Photo?
No
Is There a Cat in the Photo?
Yes
Find mapping h that assigns the “correct” target to each input
Is There a Cat in the Photo?
Yes
No
x: Input/Feature y: Output/Target
Labeled Data: The training set
Labeled Data: The training set
y= -1 means no/false
Labeled Data: The training set
Learning Algorithm
y= -1 means no/false
Labeled Data: The training set
Learning Algorithm
y= -1 means no/false
Labeled Data: The training set
Learning Algorithm
-1
y= -1 means no/false
Example: Linear Regression for Height
Sex 0
Age 30
Height 1,72 cm
Sex 1
Age 70
Height 1,52 cm
Labelled data
Male = 0Female = 1
Example Hypothesis: Linear Model
Example: Linear Regression for Height
Sex 0
Age 30
Height 1,72 cm
Sex 1
Age 70
Height 1,52 cm
Labelled data
Male = 0Female = 1
Example Training Problem:
Example Hypothesis: Linear Model
Example: Linear Regression for Height
Sex 0
Age 30
Height 1,72 cm
Sex 1
Age 70
Height 1,52 cm
Labelled data
Male = 0Female = 1
Linear Regression for Height
Age
Height Sex = 0
Linear Regression for Height
The Training Algorithm
Age
Height Sex = 0
Linear Regression for Height
The Training Algorithm
Age
Height
Other options aside from linear?
Sex = 0
Parametrizing the HypothesisHeight
Age
Linear:
Polinomial:
Age
Height
Neural Net:
Loss Functions
Why a SquaredLoss?
Loss Functions
Why a SquaredLoss?
Loss Functions
The Training Problem
Loss Functions
Why a SquaredLoss?
Loss Functions
The Training Problem
Typically a convex function
Choosing the Loss Function
Quadratic Loss
Binary Loss
Hinge Loss
Choosing the Loss Function
Quadratic Loss
Binary Loss
Hinge Loss
y=1 in all figures
Choosing the Loss Function
Quadratic Loss
Binary Loss
Hinge Loss
EXE: Plot the binary and hinge loss function in when
y=1 in all figures
Loss Functions
Is a notion of Loss enough?
What happens when we do not have enough data?
Loss FunctionsThe Training Problem
Is a notion of Loss enough?
What happens when we do not have enough data?
Overfitting and Model Complexity
Fitting 1st order polynomial
Overfitting and Model Complexity
Fitting 2nd order polynomial
Overfitting and Model Complexity
Fitting 3rd order polynomial
Overfitting and Model Complexity
Fitting 9th order polynomial
Regularizor Functions
General Training Problem
Regularization
Regularizor Functions
General Training Problem
Regularization
Goodness of fit, fidelity term ...etc
Regularizor Functions
General Training Problem
Regularization
Goodness of fit, fidelity term ...etc
Penalizes complexity
Regularizor Functions
General Training Problem
Regularization
Goodness of fit, fidelity term ...etc
Penalizes complexity
Controls tradeoff between fit and complexity
Regularizor Functions
General Training Problem
Regularization
Exe:
Goodness of fit, fidelity term ...etc
Penalizes complexity
Controls tradeoff between fit and complexity
Overfitting and Model Complexity
Fitting kth order polynomial
Overfitting and Model Complexity
Fitting kth order polynomial
For big enough, λthe solution is a 2nd order polynomial
Linear hypothesis
Exe: Ridge Regression
Ridge Regression
L2 loss
L2 regularizor
Linear hypothesis
Exe: Support Vector Machines
SVM with soft margin
Hinge loss
L2 regularizor
Linear hypothesis
Exe: Logistic Regression
Logistic Regression
Logistic loss
L2 regularizor
The Machine Learners Job
The Machine Learners Job
The Machine Learners Job
The Machine Learners Job
The Machine Learners Job
The Machine Learners Job
54
A Datum Function
Finite Sum Training Problem
Re-writing as Sum of Terms
Can we use this sum structure?
55
The Training Problem
56
The Training Problem
57
Stochastic Gradient Descent
Is it possible to design a method that uses only the gradient of a single data function at each iteration?
58
Stochastic Gradient Descent
Is it possible to design a method that uses only the gradient of a single data function at each iteration?
Unbiased EstimateLet j be a random index sampled from {1, …, n} selected uniformly at random. Then
59
Stochastic Gradient Descent
Is it possible to design a method that uses only the gradient of a single data function at each iteration?
Unbiased EstimateLet j be a random index sampled from {1, …, n} selected uniformly at random. Then
60
Stochastic Gradient Descent
Is it possible to design a method that uses only the gradient of a single data function at each iteration?
Unbiased EstimateLet j be a random index sampled from {1, …, n} selected uniformly at random. Then
EXE:
61
Stochastic Gradient Descent
62
Stochastic Gradient Descent
Optimal point
Detailed Outline today
● 13:30 – 14:00: Introduction to empirical risk minimization and classification and SGD
● 14:00 – 15:00 Revision on probability● 15:00 – 15:30: Tea Time! Break● 15:30 – 17:00: Exercises and proof of convergence of
SGD for ridge regression
64
Strong Convexity
Assumptions for Convergence
65
Strong Convexity
Assumptions for Convergence
66
Expected Bounded Stochastic Gradients
Strong Convexity
Assumptions for Convergence
67
Expected Bounded Stochastic Gradients
Strong Convexity
Assumptions for Convergence
68Complexity / Convergence
Theorem
EXE: Do exercises on convergence of random sequences.
69Complexity / Convergence
Theorem
EXE: Do exercises on convergence of random sequences.
70Complexity / Convergence
Theorem
EXE: Do exercises on convergence of random sequences.
71Stochastic Gradient Descent α =0.01
72Stochastic Gradient Descent α =0.1
73Stochastic Gradient Descent α =0.2
74Stochastic Gradient Descent α =0.5