Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the...

73
Optimization for Machine Learning Introduction into supervised learning Lecturer: Robert M. Gower Master IASD: AI Systems and Data Science, 2019

Transcript of Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the...

Page 1: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for

Optimization for Machine Learning

Introduction into supervised learning

Lecturer: Robert M. Gower

Master IASD: AI Systems and Data Science, 2019

Page 2: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for

Core Info

● Where: ENS: 07/11 amphi Langevin, 03/12 U209, 05/12 amphi Langevin.

● Online: Teaching materials for these 3 classes: https://gowerrobert.github.io/

● Google docs with course info: Can also be found on https://gowerrobert.github.io/

Page 3: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for

Outline of my three classes

● 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for ridge regression

● 03/12/19 SGD for convex optimization. Theory and variants

● 05/12/19 Lab on SGD and variants

Page 4: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for

Detailed Outline today

● 13:30 – 14:00: Introduction to empirical risk minimization and classification and SGD

● 14:00 – 15:00 Revision on probability● 15:00 – 15:30: Tea Time! Break● 15:30 – 17:00: Exercises and proof of convergence of SGD for

ridge regression

Page 5: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for

An Introduction to Supervised Learning

Page 6: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for

References classes today

Convex Optimization, Stephen Boyd

Pages 67 to 79

Understanding Machine Learning: From Theory to Algorithms

Chapter 2

Page 7: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for

Is There a Cat in the Photo?

Yes

No

Page 8: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for

Is There a Cat in the Photo?

Yes

Page 9: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for

Is There a Cat in the Photo?

Yes

Page 10: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for

Is There a Cat in the Photo?

No

Page 11: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for

Is There a Cat in the Photo?

Yes

Page 12: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for

Find mapping h that assigns the “correct” target to each input

Is There a Cat in the Photo?

Yes

No

x: Input/Feature y: Output/Target

Page 13: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for

Labeled Data: The training set

Page 14: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for

Labeled Data: The training set

y= -1 means no/false

Page 15: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for

Labeled Data: The training set

Learning Algorithm

y= -1 means no/false

Page 16: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for

Labeled Data: The training set

Learning Algorithm

y= -1 means no/false

Page 17: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for

Labeled Data: The training set

Learning Algorithm

-1

y= -1 means no/false

Page 18: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for

Example: Linear Regression for Height

Sex 0

Age 30

Height 1,72 cm

Sex 1

Age 70

Height 1,52 cm

Labelled data

Male = 0Female = 1

Page 19: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for

Example Hypothesis: Linear Model

Example: Linear Regression for Height

Sex 0

Age 30

Height 1,72 cm

Sex 1

Age 70

Height 1,52 cm

Labelled data

Male = 0Female = 1

Page 20: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for

Example Training Problem:

Example Hypothesis: Linear Model

Example: Linear Regression for Height

Sex 0

Age 30

Height 1,72 cm

Sex 1

Age 70

Height 1,52 cm

Labelled data

Male = 0Female = 1

Page 21: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for

Linear Regression for Height

Age

Height Sex = 0

Page 22: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for

Linear Regression for Height

The Training Algorithm

Age

Height Sex = 0

Page 23: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for

Linear Regression for Height

The Training Algorithm

Age

Height

Other options aside from linear?

Sex = 0

Page 24: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for

Parametrizing the HypothesisHeight

Age

Linear:

Polinomial:

Age

Height

Neural Net:

Page 25: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for

Loss Functions

Why a SquaredLoss?

Page 26: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for

Loss Functions

Why a SquaredLoss?

Loss Functions

The Training Problem

Page 27: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for

Loss Functions

Why a SquaredLoss?

Loss Functions

The Training Problem

Typically a convex function

Page 28: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for

Choosing the Loss Function

Quadratic Loss

Binary Loss

Hinge Loss

Page 29: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for

Choosing the Loss Function

Quadratic Loss

Binary Loss

Hinge Loss

y=1 in all figures

Page 30: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for

Choosing the Loss Function

Quadratic Loss

Binary Loss

Hinge Loss

EXE: Plot the binary and hinge loss function in when

y=1 in all figures

Page 31: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for

Loss Functions

Is a notion of Loss enough?

What happens when we do not have enough data?

Page 32: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for

Loss FunctionsThe Training Problem

Is a notion of Loss enough?

What happens when we do not have enough data?

Page 33: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for

Overfitting and Model Complexity

Fitting 1st order polynomial

Page 34: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for

Overfitting and Model Complexity

Fitting 2nd order polynomial

Page 35: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for

Overfitting and Model Complexity

Fitting 3rd order polynomial

Page 36: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for

Overfitting and Model Complexity

Fitting 9th order polynomial

Page 37: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for

Regularizor Functions

General Training Problem

Regularization

Page 38: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for

Regularizor Functions

General Training Problem

Regularization

Goodness of fit, fidelity term ...etc

Page 39: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for

Regularizor Functions

General Training Problem

Regularization

Goodness of fit, fidelity term ...etc

Penalizes complexity

Page 40: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for

Regularizor Functions

General Training Problem

Regularization

Goodness of fit, fidelity term ...etc

Penalizes complexity

Controls tradeoff between fit and complexity

Page 41: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for

Regularizor Functions

General Training Problem

Regularization

Exe:

Goodness of fit, fidelity term ...etc

Penalizes complexity

Controls tradeoff between fit and complexity

Page 42: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for

Overfitting and Model Complexity

Fitting kth order polynomial

Page 43: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for

Overfitting and Model Complexity

Fitting kth order polynomial

For big enough, λthe solution is a 2nd order polynomial

Page 44: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for

Linear hypothesis

Exe: Ridge Regression

Ridge Regression

L2 loss

L2 regularizor

Page 45: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for

Linear hypothesis

Exe: Support Vector Machines

SVM with soft margin

Hinge loss

L2 regularizor

Page 46: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for

Linear hypothesis

Exe: Logistic Regression

Logistic Regression

Logistic loss

L2 regularizor

Page 47: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for

The Machine Learners Job

Page 48: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for

The Machine Learners Job

Page 49: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for

The Machine Learners Job

Page 50: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for

The Machine Learners Job

Page 51: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for

The Machine Learners Job

Page 52: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for

The Machine Learners Job

Page 53: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for

54

A Datum Function

Finite Sum Training Problem

Re-writing as Sum of Terms

Can we use this sum structure?

Page 54: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for

55

The Training Problem

Page 55: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for

56

The Training Problem

Page 56: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for

57

Stochastic Gradient Descent

Is it possible to design a method that uses only the gradient of a single data function at each iteration?

Page 57: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for

58

Stochastic Gradient Descent

Is it possible to design a method that uses only the gradient of a single data function at each iteration?

Unbiased EstimateLet j be a random index sampled from {1, …, n} selected uniformly at random. Then

Page 58: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for

59

Stochastic Gradient Descent

Is it possible to design a method that uses only the gradient of a single data function at each iteration?

Unbiased EstimateLet j be a random index sampled from {1, …, n} selected uniformly at random. Then

Page 59: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for

60

Stochastic Gradient Descent

Is it possible to design a method that uses only the gradient of a single data function at each iteration?

Unbiased EstimateLet j be a random index sampled from {1, …, n} selected uniformly at random. Then

EXE:

Page 60: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for

61

Stochastic Gradient Descent

Page 61: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for

62

Stochastic Gradient Descent

Optimal point

Page 62: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for

Detailed Outline today

● 13:30 – 14:00: Introduction to empirical risk minimization and classification and SGD

● 14:00 – 15:00 Revision on probability● 15:00 – 15:30: Tea Time! Break● 15:30 – 17:00: Exercises and proof of convergence of

SGD for ridge regression

Page 63: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for

64

Strong Convexity

Assumptions for Convergence

Page 64: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for

65

Strong Convexity

Assumptions for Convergence

Page 65: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for

66

Expected Bounded Stochastic Gradients

Strong Convexity

Assumptions for Convergence

Page 66: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for

67

Expected Bounded Stochastic Gradients

Strong Convexity

Assumptions for Convergence

Page 67: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for

68Complexity / Convergence

Theorem

EXE: Do exercises on convergence of random sequences.

Page 68: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for

69Complexity / Convergence

Theorem

EXE: Do exercises on convergence of random sequences.

Page 69: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for

70Complexity / Convergence

Theorem

EXE: Do exercises on convergence of random sequences.

Page 70: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for

71Stochastic Gradient Descent α =0.01

Page 71: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for

72Stochastic Gradient Descent α =0.1

Page 72: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for

73Stochastic Gradient Descent α =0.2

Page 73: Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for

74Stochastic Gradient Descent α =0.5