Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the...

Optimization for Machine Learning

Introduction into supervised learning

Lecturer: Robert M. Gower

Master IASD: AI Systems and Data Science, 2019

Core Info

● Where: ENS: 07/11 amphi Langevin, 03/12 U209, 05/12 amphi Langevin.

● Online: Teaching materials for these 3 classes: https://gowerrobert.github.io/

● Google docs with course info: Can also be found on https://gowerrobert.github.io/

Outline of my three classes

● 07/11/19 Foundations and the empirical risk problem, revision probability, SGD (Stochastic Gradient Descent) for ridge regression

● 03/12/19 SGD for convex optimization. Theory and variants

● 05/12/19 Lab on SGD and variants

Detailed Outline today

● 13:30 – 14:00: Introduction to empirical risk minimization and classification and SGD

● 14:00 – 15:00 Revision on probability● 15:00 – 15:30: Tea Time! Break● 15:30 – 17:00: Exercises and proof of convergence of SGD for

ridge regression

An Introduction to Supervised Learning

References classes today

Convex Optimization, Stephen Boyd

Pages 67 to 79

Understanding Machine Learning: From Theory to Algorithms

Chapter 2

Is There a Cat in the Photo?

Find mapping h that assigns the “correct” target to each input

Is There a Cat in the Photo?

x: Input/Feature y: Output/Target

Labeled Data: The training set

y= -1 means no/false

Learning Algorithm

Example: Linear Regression for Height

Age 30

Height 1,72 cm

Age 70

Height 1,52 cm

Labelled data

Male = 0Female = 1

Example Hypothesis: Linear Model

Age 30

Height 1,72 cm

Age 70

Height 1,52 cm

Labelled data

Male = 0Female = 1

Example Training Problem:

Example Hypothesis: Linear Model

Age 30

Height 1,72 cm

Age 70

Height 1,52 cm

Labelled data

Male = 0Female = 1

Linear Regression for Height

Height Sex = 0

The Training Algorithm

Height Sex = 0

The Training Algorithm

Height

Other options aside from linear?

Sex = 0

Parametrizing the HypothesisHeight

Linear:

Polinomial:

Height

Neural Net:

Loss Functions

Why a SquaredLoss?

Loss Functions

Why a SquaredLoss?

Loss Functions

The Training Problem

Loss Functions

Why a SquaredLoss?

Loss Functions

Typically a convex function

Choosing the Loss Function

Quadratic Loss

Binary Loss

Hinge Loss

Quadratic Loss

Binary Loss

Hinge Loss

y=1 in all figures

Quadratic Loss

Binary Loss

Hinge Loss

EXE: Plot the binary and hinge loss function in when

y=1 in all figures

Loss Functions

Is a notion of Loss enough?

What happens when we do not have enough data?

Loss FunctionsThe Training Problem

Is a notion of Loss enough?

What happens when we do not have enough data?

Overfitting and Model Complexity

Fitting 1st order polynomial

Fitting 2nd order polynomial

Fitting 3rd order polynomial

Fitting 9th order polynomial

Regularizor Functions

General Training Problem

Regularization

Goodness of fit, fidelity term ...etc

Regularization

Penalizes complexity

Regularization

Controls tradeoff between fit and complexity

Regularization

Controls tradeoff between fit and complexity

Fitting kth order polynomial

For big enough, λthe solution is a 2nd order polynomial

Linear hypothesis

Exe: Ridge Regression

Ridge Regression

L2 loss

L2 regularizor

Linear hypothesis

Exe: Support Vector Machines

SVM with soft margin

Hinge loss

L2 regularizor

Linear hypothesis

Exe: Logistic Regression

Logistic Regression

Logistic loss

L2 regularizor

The Machine Learners Job

A Datum Function

Finite Sum Training Problem

Re-writing as Sum of Terms

Can we use this sum structure?

Stochastic Gradient Descent

Is it possible to design a method that uses only the gradient of a single data function at each iteration?

Unbiased EstimateLet j be a random index sampled from {1, …, n} selected uniformly at random. Then

Optimal point

Detailed Outline today

● 13:30 – 14:00: Introduction to empirical risk minimization and classification and SGD

● 14:00 – 15:00 Revision on probability● 15:00 – 15:30: Tea Time! Break● 15:30 – 17:00: Exercises and proof of convergence of

SGD for ridge regression

Strong Convexity

Assumptions for Convergence

Strong Convexity

Expected Bounded Stochastic Gradients

Strong Convexity

Expected Bounded Stochastic Gradients

Strong Convexity

68Complexity / Convergence

Theorem

EXE: Do exercises on convergence of random sequences.

Theorem

71Stochastic Gradient Descent α =0.01

Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the...

Documents

Transcript of Introduction into supervised learning · Outline of my three classes 07/11/19 Foundations and the...

GASGD: Stochastic Gradient Descent for Distributed … · 2015. 8. 2. · rently, stochastic gradient descent (SGD) is considered one of the best latent factor-based algorithm for

Learning to learn by gradient descent by gradient descentpapers.nips.cc/...to...descent-by-gradient-descent.pdf · Learning to learn by gradient descent by gradient descent Marcin

Empirical Investigation of Optimization Algorithms in ... · PDF filerun leads to improvement ... Stochastic Gradient Descent ... 15 20 25 Iterations [%] SGD Adagrad RmsProp Adadelta

STOCHASTIC GRADIENT DESCENT, ENTROPY-SGD › teaching › learning › deeplearning...STOCHASTIC GRADIENT DESCENT, ENTROPY-SGD PRATIK CHAUDHARI UNIVERSITY OF PENNSYLVANIA NOVEMBER

Intro Logistic Regression Gradient Descent + SGDIntro Logistic Regression Gradient Descent + SGD Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Learning to learn by gradient descent by gradient descent · Learning to learn by gradient descent by gradient descent Marcin Andrychowicz 1, Misha Denil , Sergio Gómez Colmenarejo

1 Lecture 10: descent methods Gradient descent (reminder)

Intro Logistic+Regression Gradient+Descent+++SGD...9 SGD:+Stochastic+Gradient+Ascent+(or+Descent) • “True”gradient: • Samplebasedapproximation: • Whatifweestimategradientwithjustonesample???

Gradient Descent Optimization

Proximal Gradient Descent › ~aarti › Class › 10725_Fall17 › Lecture_Slides › ...Proximal gradient descent has convergence rate O(1=k), or O(1= ) Same as gradient descent!

Gradient descent

Multiple Gradient Descent Algorithm

Gradient descent GAN optimization is locally stablepapers.nips.cc/paper/7142-gradient-descent-gan-optimization-is... · Gradient descent GAN optimization is locally stable ... similarities

Stochastic Gradient DescentOutline É What is Stochastic Gradient Descent É Comparison between BGD and SGD É Analysis on SGD É Extensions and Variants 2/23 Stochastic Gradient Descent

Stochastic Gradient Descent Tricks › diglib › lsml › bottou-sgd-tricks-2012.pdfTable 1 illustrates stochastic gradient descent algorithms for a number of classic machine learning

Learning to learn by gradient descent by gradient descent · Learning to learn by gradient descent by gradient descent Liyan Jiang July 18, 2019 1 Introduction The general aim of

Stochastic Gradient Descent with Variance Reductionranger.uta.edu/~heng/CSE6389_15_slides/SGD2.pdf · Outline 1 Problem 2 Stochastic Average Gradient (SAG) 3 Accelerating SGD using

Statistical inference using SGD - GitHub Pagesakyrillidis.github.io/pubs/Conferences/StatSGD.pdf · stochastic gradient descent (SGD) with a ﬁxed step size: we demonstrate that

Pattern Recognition: Introduzionebias.csr.unibo.it/maltoni/ml/DispensePDF/8_ML_RetiNeurali.pdf · On-line Backpropagation Stochastic Gradient Descent (SGD) Approfondimenti Softmax

Stochastic Gradient Descent as Approximate Bayesian Inferenceblei/papers/MandtHoffmanBlei2017.pdf1. Introduction Stochastic gradient descent (SGD) has become crucial to modern machine