Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the...

Post on 30-May-2020

2 views 0 download

Transcript of Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the...

Optimization for Machine Learning

Introduction into supervised learning

Lecturer: Robert M. Gower

Master IASD: AI Systems and Data Science, 2019

Core Info

● Where: ENS: 07/11 amphi Langevin, 03/12 U209, 05/12 amphi Langevin.

● Online: Teaching materials for these 3 classes: https://gowerrobert.github.io/

● Google docs with course info: Can also be found on https://gowerrobert.github.io/

Outline of my three classes

● 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for ridge regression

● 03/12/19 SGD for convex optimization. Theory and variants

● 05/12/19 Lab on SGD and variants

Detailed Outline today

● 13:30 – 14:00: Introduction to empirical risk minimization and classification and SGD

● 14:00 – 15:00 Revision on probability● 15:00 – 15:30: Tea Time! Break● 15:30 – 17:00: Exercises and proof of convergence of SGD for

ridge regression

An Introduction to Supervised Learning

References classes today

Convex Optimization, Stephen Boyd

Pages 67 to 79

Understanding Machine Learning: From Theory to Algorithms

Chapter 2

Is There a Cat in the Photo?

Yes

No

Is There a Cat in the Photo?

Yes

Is There a Cat in the Photo?

Yes

Is There a Cat in the Photo?

No

Is There a Cat in the Photo?

Yes

Find mapping h that assigns the “correct” target to each input

Is There a Cat in the Photo?

Yes

No

x: Input/Feature y: Output/Target

Labeled Data: The training set

Labeled Data: The training set

y= -1 means no/false

Labeled Data: The training set

Learning Algorithm

y= -1 means no/false

Labeled Data: The training set

Learning Algorithm

y= -1 means no/false

Labeled Data: The training set

Learning Algorithm

-1

y= -1 means no/false

Example: Linear Regression for Height

Sex 0

Age 30

Height 1,72 cm

Sex 1

Age 70

Height 1,52 cm

Labelled data

Male = 0Female = 1

Example Hypothesis: Linear Model

Example: Linear Regression for Height

Sex 0

Age 30

Height 1,72 cm

Sex 1

Age 70

Height 1,52 cm

Labelled data

Male = 0Female = 1

Example Training Problem:

Example Hypothesis: Linear Model

Example: Linear Regression for Height

Sex 0

Age 30

Height 1,72 cm

Sex 1

Age 70

Height 1,52 cm

Labelled data

Male = 0Female = 1

Linear Regression for Height

Age

Height Sex = 0

Linear Regression for Height

The Training Algorithm

Age

Height Sex = 0

Linear Regression for Height

The Training Algorithm

Age

Height

Other options aside from linear?

Sex = 0

Parametrizing the HypothesisHeight

Age

Linear:

Polinomial:

Age

Height

Neural Net:

Loss Functions

Why a SquaredLoss?

Loss Functions

Why a SquaredLoss?

Loss Functions

The Training Problem

Loss Functions

Why a SquaredLoss?

Loss Functions

The Training Problem

Typically a convex function

Choosing the Loss Function

Quadratic Loss

Binary Loss

Hinge Loss

Choosing the Loss Function

Quadratic Loss

Binary Loss

Hinge Loss

y=1 in all figures

Choosing the Loss Function

Quadratic Loss

Binary Loss

Hinge Loss

EXE: Plot the binary and hinge loss function in when

y=1 in all figures

Loss Functions

Is a notion of Loss enough?

What happens when we do not have enough data?

Loss FunctionsThe Training Problem

Is a notion of Loss enough?

What happens when we do not have enough data?

Overfitting and Model Complexity

Fitting 1st order polynomial

Overfitting and Model Complexity

Fitting 2nd order polynomial

Overfitting and Model Complexity

Fitting 3rd order polynomial

Overfitting and Model Complexity

Fitting 9th order polynomial

Regularizor Functions

General Training Problem

Regularization

Regularizor Functions

General Training Problem

Regularization

Goodness of fit, fidelity term ...etc

Regularizor Functions

General Training Problem

Regularization

Goodness of fit, fidelity term ...etc

Penalizes complexity

Regularizor Functions

General Training Problem

Regularization

Goodness of fit, fidelity term ...etc

Penalizes complexity

Controls tradeoff between fit and complexity

Regularizor Functions

General Training Problem

Regularization

Exe:

Goodness of fit, fidelity term ...etc

Penalizes complexity

Controls tradeoff between fit and complexity

Overfitting and Model Complexity

Fitting kth order polynomial

Overfitting and Model Complexity

Fitting kth order polynomial

For big enough, λthe solution is a 2nd order polynomial

Linear hypothesis

Exe: Ridge Regression

Ridge Regression

L2 loss

L2 regularizor

Linear hypothesis

Exe: Support Vector Machines

SVM with soft margin

Hinge loss

L2 regularizor

Linear hypothesis

Exe: Logistic Regression

Logistic Regression

Logistic loss

L2 regularizor

The Machine Learners Job

The Machine Learners Job

The Machine Learners Job

The Machine Learners Job

The Machine Learners Job

The Machine Learners Job

54

A Datum Function

Finite Sum Training Problem

Re-writing as Sum of Terms

Can we use this sum structure?

55

The Training Problem

56

The Training Problem

57

Stochastic Gradient Descent

Is it possible to design a method that uses only the gradient of a single data function at each iteration?

58

Stochastic Gradient Descent

Is it possible to design a method that uses only the gradient of a single data function at each iteration?

Unbiased EstimateLet j be a random index sampled from {1, …, n} selected uniformly at random. Then

59

Stochastic Gradient Descent

Is it possible to design a method that uses only the gradient of a single data function at each iteration?

Unbiased EstimateLet j be a random index sampled from {1, …, n} selected uniformly at random. Then

60

Stochastic Gradient Descent

Is it possible to design a method that uses only the gradient of a single data function at each iteration?

Unbiased EstimateLet j be a random index sampled from {1, …, n} selected uniformly at random. Then

EXE:

61

Stochastic Gradient Descent

62

Stochastic Gradient Descent

Optimal point

Detailed Outline today

● 13:30 – 14:00: Introduction to empirical risk minimization and classification and SGD

● 14:00 – 15:00 Revision on probability● 15:00 – 15:30: Tea Time! Break● 15:30 – 17:00: Exercises and proof of convergence of

SGD for ridge regression

64

Strong Convexity

Assumptions for Convergence

65

Strong Convexity

Assumptions for Convergence

66

Expected Bounded Stochastic Gradients

Strong Convexity

Assumptions for Convergence

67

Expected Bounded Stochastic Gradients

Strong Convexity

Assumptions for Convergence

68Complexity / Convergence

Theorem

EXE: Do exercises on convergence of random sequences.

69Complexity / Convergence

Theorem

EXE: Do exercises on convergence of random sequences.

70Complexity / Convergence

Theorem

EXE: Do exercises on convergence of random sequences.

71Stochastic Gradient Descent α =0.01

72Stochastic Gradient Descent α =0.1

73Stochastic Gradient Descent α =0.2

74Stochastic Gradient Descent α =0.5