UVA CS 4501: Machine Learning Lecture 11: Support Vector … · 2020. 8. 4. · Computational...

UVACS4501:MachineLearning

Lecture11:SupportVectorMachine

(Basics)

Dr.YanjunQi

UniversityofVirginia

DepartmentofComputerScience

Wherearewe?èFivemajorsec@onsofthiscourse

q Regression(supervised)q Classifica@on(supervised)q Unsupervisedmodelsq Learningtheoryq Graphicalmodels

10/18/18 2

Dr.YanjunQi/UVA

q SupportVectorMachine(SVM)ü HistoryofSVMü LargeMarginLinearClassifierü DefineMargin(M)intermsofmodelparameterü Op@miza@ontolearnmodelparameters(w,b)ü LinearlyNon-separablecaseü Op@miza@onwithdualformü Nonlineardecisionboundaryü Mul@classSVM

10/18/18 3

Dr.YanjunQi/UVA

HistoryofSVM•  SVMisinspiredfromsta@s@callearningtheory[3]•  SVMwasfirstintroducedin1992[1]

•  SVMbecomespopularbecauseofitssuccessinhandwri[endigitrecogni@on(1994)–  1.1%testerrorrateforSVM.–  Thesameastheerrorratesofacarefullyconstructedneuralnetwork,LeNet

4.•  Sec@on5.11in[2]orthediscussionin[3]fordetails

•  Regardedasanimportantexampleof“kernelmethods”,arguablytheho[estareainmachinelearning20yearsago

[1] B.E. Boser et al. A Training Algorithm for Optimal Margin Classifiers. Proceedings of the Fifth Annual Workshop on Computational Learning Theory 5 144-152, Pittsburgh, 1992.

[2] L. Bottou et al. Comparison of classifier methods: a case study in handwritten digit recognition. Proceedings of the 12th IAPR International Conference on Pattern Recognition, vol. 2, pp. 77-82, 1994.

[3] V. Vapnik. The Nature of Statistical Learning Theory. 2nd edition, Springer, 1999.

10/18/18 4

Dr.YanjunQi/UVA

Theore@callysound/Impaccul

Handwri[endigitrecogni@on

10/18/18

Dr.YanjunQi/UVA

Applica@onsofSVMs

•  ComputerVision•  TextCategoriza@on•  Ranking(e.g.,Googlesearches)•  Handwri[enCharacterRecogni@on•  Timeseriesanalysis•  Bioinforma@cs•  ……….

àLotsofverysuccessfulapplica@ons!!!

10/18/18

Dr.YanjunQi/UVA

ADatasetforbinary

classifica@on

•  Data/points/instances/examples/samples/records:[rows]•  Features/a0ributes/dimensions/independentvariables/covariates/

predictors/regressors:[columns,exceptthelast]•  Target/outcome/response/label/dependentvariable:special

columntobepredicted[lastcolumn]

10/18/18 7

Output as Binary Class:

only two possibilities

Dr.YanjunQi/UVA

q SupportVectorMachine(SVM)ü HistoryofSVMü LargeMarginLinearClassifierü DefineMargin(M)intermsofmodelparameterü Op@miza@ontolearnmodelparameters(w,b)ü LinearlyNon-separablecaseü Op@miza@onwithdualformü Nonlineardecisionboundaryü Mul@classSVM

10/18/18 8

Dr.YanjunQi/UVA

LinearClassifiersfx yest

denotes+1denotes-1

Howwouldyouclassifythisdata?

10/18/18 9

Dr.YanjunQi/UVA

Credit:Prof.Moore

denotes+1denotes-1

10/18/18 10

Dr.YanjunQi/UVA

Credit:Prof.Moore

denotes+1denotes-1

10/18/18 11

Dr.YanjunQi/UVA

Credit:Prof.Moore

denotes+1denotes-1

10/18/18 12

Dr.YanjunQi/UVA

Credit:Prof.Moore

denotes+1denotes-1

Anyofthesewouldbefine....butwhichisbest?

10/18/18 13

Dr.YanjunQi/UVA

Credit:Prof.Moore

ClassifierMarginfx yest

denotes+1denotes-1 Definethemarginof

alinearclassifierasthewidththattheboundarycouldbeincreasedbybeforehilngadatapoint.

10/18/18 14

Dr.YanjunQi/UVA

Credit:Prof.Moore

MaximumMarginfx yest

denotes+1denotes-1 Themaximum

marginlinearclassifieristhelinearclassifierwiththe,maximummargin.ThisisthesimplestkindofSVM(CalledanLSVM)

LinearSVM10/18/18 15

Dr.YanjunQi/UVA

Credit:Prof.Moore

SupportVectorsarethosedatapointsthatthemarginpushesupagainst

LinearSVM10/18/18 16

Dr.YanjunQi/UVA

Credit:Prof.Moore

SupportVectorsarethosedatapointsthatthemarginpushesupagainst

LinearSVM

f(x,w,b)=sign(wTx+b)

10/18/18 17

Dr.YanjunQi/UVA

Credit:Prof.Moore

Max margin classifiers

•  Instead of fitting all points, focus on boundary points

•  Learn a boundary that leads to the largest margin from both sets of points

From all the possible boundary lines, this leads to the largest margin on both sides

10/18/18 18

Dr.YanjunQi/UVA

Credit:Prof.Moore

•  Learn a boundary that leads to the largest margin from points on both sides

D Why MAX margin?

•  Intuitive, ‘makes sense’

•  Some theoretical support (using VC dimension)

•  Works well in practice

10/18/18 19

Dr.YanjunQi/UVA

Credit:Prof.Moore

•  Learn a boundary that leads to the largest margin from points on both sides

D That is why Also known as linear support vector machines (SVMs)

These are the vectors supporting the boundary

10/18/18 20

Dr.YanjunQi/UVA

Credit:Prof.Moore

HowtorepresentaLinearDecisionBoundary?

10/18/18

Dr.YanjunQi/UVA

Review:AffineHyperplanes

•  h[ps://en.wikipedia.org/wiki/Hyperplane•  Anyhyperplanecanbegivenincoordinatesasthesolu@onofasinglelinear(algebraic)equa@onofdegree1.

10/18/18

Dr.YanjunQi/UVA

Q:Howdoesthisconnecttolinearregression?

10/18/18

Dr.YanjunQi/UVA

Max-margin&DecisionBoundary

Class -1

Class 1

10/18/18

Dr.YanjunQi/UVA

W is a p-dim vector; b is a

scalar

Max-margin&DecisionBoundary•  Thedecisionboundaryshouldbeasfarawayfrom

thedataofbothclassesaspossible

Class -1

Class 1

W is a p-dim vector; b is a

scalar

10/18/18

Dr.YanjunQi/UVA

Specifying a max margin classifier

Predict class +1

Predict class -1 wTx+b=+1

wTx+b=0

wTx+b=-1

Classify as +1 if wTx+b >= 1 Classify as -1 if wTx+b <= - 1 Undefined if -1 <wTx+b < 1

Class +1 plane

boundary

Class -1 plane

10/18/18 26

Dr.YanjunQi/UVA

Specifying a max margin classifier

Predict class +1

wTx+b=0

wTx+b=-1

Is the linear separation assumption realistic?

We will deal with this shortly, but lets assume it for now

10/18/18 27

Dr.YanjunQi/UVA

q SupervisedClassifica@onq SupportVectorMachine(SVM)

ü HistoryofSVMü LargeMarginLinearClassifierü DefineMargin(M)intermsofmodelparameterü Op@miza@ontolearnmodelparameters(w,b)ü LinearlyNon-separablecaseü Op@miza@onwithdualformü Nonlineardecisionboundaryü Mul@classSVM

10/18/18 28

Dr.YanjunQi/UVA

Maximizing the margin

Predict class +1

wTx+b=0

wTx+b=-1

Classify as +1 if wTx+b >= 1 Classify as -1 if wTx+b <= - 1 Undefined if -1 <wTx+b < 1 M

•  Lets define the width of the margin by M

•  How can we encode our goal of maximizing M in terms of our parameters (w and b)?

•  Lets start with a few obsevrations

10/18/18 29

Dr.YanjunQi/UVA

Concretederiva@onsofMseeExtra

10/18/18

Dr.YanjunQi/UVA

Finding the optimal parameters

Predict class +1

wTx+b=0

wTx+b=-1

M =2wTw

We can now search for the optimal parameters by finding a solution that:

1.  Correctly classifies all points

2.  Maximizes the margin (or equivalently minimizes wTw)

Several optimization methods can be used: Gradient descent, OR SMO (see extra slides)

10/18/18 31

Dr.YanjunQi/UVA

q SupportVectorMachine(SVM)ü HistoryofSVMü LargeMarginLinearClassifierü DefineMargin(M)intermsofmodelparameterü Op@miza@ontolearnmodelparameters(w,b)ü LinearlyNon-separablecaseü Op@miza@onwithdualformü Nonlineardecisionboundaryü Prac@calGuide

10/18/18 32

Dr.YanjunQi/UVA

Optimization Step i.e. learning optimal parameter for SVM

Predict class +1

wTx+b=0

wTx+b=-1

x- €

M =2wTw

10/18/18 33

Dr.YanjunQi/UVA

Predict class +1

wTx+b=0

wTx+b=-1

x- €

M =2wTw

Min (wTw)/2 subject to the following constraints:

10/18/18 34

Dr.YanjunQi/UVA

Predict class +1

wTx+b=0

wTx+b=-1

x- €

M =2wTw

For all x in class + 1

wTx+b >= 1

For all x in class - 1

wTx+b <= -1

} A total of n constraints if we have n training samples

10/18/18 35

Dr.YanjunQi/UVA

Optimization Reformulation

wTx+b >= 1

wTx+b <= -1

} A total of n constraints if we have n input samples

10/18/18 36

Dr.YanjunQi/UVA

wTx+b >= 1

wTx+b <= -1

} A total of n constraints if we have n input samples

10/18/18 37

Dr.YanjunQi/UVA

argminw,b

i=1p∑

subject to ∀x i ∈Dtrain : yi x i ⋅w + b( ) ≥1

wTx+b >= 1

wTx+b <= -1

}A total of n constraints if we have n input samples

10/18/18 38

Dr.YanjunQi/UVA

argminw,b

i=1p∑

subject to ∀x i ∈Dtrain : yi wT x i + b( ) ≥1

Quadratic Objective

Quadratic programming i.e., -  Quadratic objective -  Linear constraints

WhatNext?

q SupportVectorMachine(SVM)ü HistoryofSVMü LargeMarginLinearClassifierü DefineMargin(M)intermsofmodelparameterü Op@miza@ontolearnmodelparameters(w,b)ü LinearlyNon-separablecase(sopSVM)ü Op@miza@onwithdualformü Nonlineardecisionboundaryü Prac@calGuide

10/18/18 39

Dr.YanjunQi/UVA

Support Vector Machine

classification

Kernel Trick Func K(xi, xj)

Margin + Hinge Loss

QP with Dual form

Dual Weights

Representation

Score Function

Search/Optimization

Models, Parameters

w = α ixiyii∑

argminw,b

i=1p∑ +C εi

subject to ∀xi ∈ Dtrain : yi xi ⋅w+b( ) ≥1−εi

K(x, z) :=Φ(x)TΦ(z)

maxα α i −i∑ 1

2 α iα jyi y ji,j∑ xi

α iyi =0i∑ , α i ≥0 ∀i

References•  BigthankstoProf.ZivBar-JosephandProf.EricXing@CMUforallowingmetoreusesomeofhisslides

•  ElementsofSta@s@calLearning,byHas@e,TibshiraniandFriedman

•  Prof.AndrewMoore@CMU’sslides•  TutorialslidesfromDr.Tie-YanLiu,MSRAsia•  APrac@calGuidetoSupportVectorClassifica@onChih-WeiHsu,Chih-ChungChang,andChih-JenLin,2003-2010

•  TutorialslidesfromStanford“ConvexOp@miza@onI—Boyd&Vandenberghe

10/18/18 41

Dr.YanjunQi/UVACS

10/18/18

Dr.YanjunQi/UVA

10/18/18

Dr.YanjunQi/UVA

Thegradientpointsinthedirec3onofthegreatestrateofincreaseofthefunc3onanditsmagnitudeistheslopeofthegraphinthatdirec3on

How to definethewidthofthemarginbyM(EXTRA)

Predict class +1

wTx+b=0

wTx+b=-1

Classify as +1 if wTx+b >= 1 Classify as -1 if wTx+b <= - 1 Undefined if -1 <wTx+b < 1 M

•  Lets define the width of the margin by M

•  How can we encode our goal of maximizing M in terms of our parameters (w and b)?

•  Lets start with a few obsevrations

10/18/18 44

Dr.YanjunQi/UVA

Concretederiva@onsofMseeExtra

MarginM

10/18/18

Dr.YanjunQi/UVA

10/18/18

Dr.YanjunQi/UVA

•  wT x+ + b = +1

•  wT x- + b = -1

•  M = | x+ - x- | = ?

Maximizing the margin: observation-1

Predict class +1

wTx+b=0

wTx+b=-1

•  Observation 1: the vector w is orthogonal to the +1 plane

•  Why?

Corollary: the vector w is orthogonal to the -1 plane 10/18/18 47

Dr.YanjunQi/UVA

Predict class +1

wTx+b=0

wTx+b=-1

•  Why?

Let u and v be two points on the +1 plane, then for the vector defined by u and v we have wT(u-v) = 0

Dr.YanjunQi/UVA

Predict class +1

wTx+b=0

wTx+b=-1

•  Why?

Let u and v be two points on the +1 plane, then for the vector defined by u and v we have wT(u-v) = 0

Dr.YanjunQi/UVA

Review:VectorProduct,Orthogonal,andNorm

Fortwovectorsxandy,

iscalledthe(inner)vectorproduct.

Thesquarerootoftheproductofavectorwithitself,

iscalledthe2-norm(|x|2),canalsowriteas|x|

xandyarecalledorthogonalif

10/18/18 50

Dr.YanjunQi/UVA

Observa@on1èReview:Orthogonal

10/18/18 51

Dr.YanjunQi/UVA

•  ObservaMon1:thevectorwisorthogonaltothe+1plane

Class 1

Class 2

10/18/18

Dr.YanjunQi/UVA

Predict class +1

wTx+b=0

wTx+b=-1

•  Observation 1: the vector w is orthogonal to the +1 and -1 planes

•  Observation 2: if x+ is a point on the +1 plane and x- is the closest point to x+ on the -1 plane then

x+ = λ w + x-

Since w is orthogonal to both planes we need to ‘travel’ some distance along w to get from x+ to x-

10/18/18 53

Dr.YanjunQi/UVA

10/18/18

Dr.YanjunQi/UVA

•  wT x+ + b = +1

•  wT x- + b = -1

•  M = | x+ - x- | = ?

•  x+ = λ w + x-

Putting it together

Predict class +1

wTx+b=0

wTx+b=-1

•  wT x+ + b = +1

•  wT x- + b = -1

•  x+ = λ w + x-

•  | x+ - x- | = M

We can now define M in terms of w and b

10/18/18 55

Dr.YanjunQi/UVA

10/18/18

Dr.YanjunQi/UVA

Putting it together

Predict class +1

wTx+b=0

wTx+b=-1

•  wT x+ + b = +1

•  wT x- + b = -1

•  x+ = λw + x-

•  | x+ - x- | = M

wT x+ + b = +1

wT (λw + x-) + b = +1 =>

wTx- + b + λwTw = +1

-1 + λwTw = +1

λ = 2/wTw

10/18/18 57

Dr.YanjunQi/UVA

Putting it together

Predict class +1

wTx+b=0

wTx+b=-1

•  wT x+ + b = +1

•  wT x- + b = -1

•  x+ = λ w + x-

•  | x+ - x- | = M

•  λ = 2/wTw

M = |x+ - x-|

M =| λw |= λ | w |= λ wTw

M = 2 wTwwTw

10/18/18 58

Dr.YanjunQi/UVA

UVA CS 4501: Machine Learning Lecture 11: Support Vector … · 2020. 8. 4. · Computational...

Documents

Transcript of UVA CS 4501: Machine Learning Lecture 11: Support Vector … · 2020. 8. 4. · Computational...

Gradient-Based Learning Applied to Document Recognition · Gradient-Based Learning Applied to Document Recognition YANN LECUN, MEMBER, IEEE, LEON BOTTOU, YOSHUA BENGIO,´ AND PATRICK

UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture … · – Fall 2016 Machine Learning Lecture 22: Review ... conditional mean NA Training ... • k-Nearest neighbor classifier

UVA CS 4501: Machine Learning Lecture 3: Linear Regression ... · Machine Learning Lecture 3: Linear Regression Basics Dr. Yanjun Qi University of Virginia ... q Learning theory q

L eon Bottou Frank E. Curtis Jorge Nocedalz June 16, 2016 · 2016-06-16 · Optimization Methods for Large-Scale Machine Learning L eon Bottou Frank E. Curtisy Jorge Nocedalz June

Montego Batiks Moda - modafabrics.com · Montego Batiks Moda 33 34 4501 11 4501 12 4501 13 4501 14 4501 15 4501 16 4501 17 4501 18 4501 19 4501 20 4501 21 4501 22 4501 23 4501 24

4501 TAMIAMI TRAIL N. FOR LEASE

Can Neural Machine Translation be Improved with User Feedback?riezler/... · ads is the standard learning signal for response prediction in online advertising (Bottou et al., 2013),

UVA CS 4501: Machine Learning Lecture 17: Generative Bayes ... · Machine Learning Lecture 17: Generative Bayes Classifiers Dr. Yanjun Qi University of Virginia Department of Computer

Alekh Agarwal Microsoft Research · 2020. 1. 3. · Para-active learning Alekh Agarwal Microsoft Research Joint work with L eon Bottou, Miroslav Dud k and John Langford Agarwal, Bottou,

UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 21: EM … · 2020-01-12 · UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 21: EM (Extra) Dr. Yanjun Qi University

Convolutional Neural Network Architectures: from …...Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learning applied to document recognition, Proc. IEEE 86(11): 2278–2324,

Why Does Unsupervised Pre-training Help Deep Learning? · Samy Bengio BENGIO@GOOGLE.COM Google Research 1600 Amphitheatre Parkway Mountain View, CA, 94043, USA Editor: Leon Bottou´

Stochastic Gradient Descent Tricks › diglib › lsml › bottou-sgd-tricks-2012.pdfTable 1 illustrates stochastic gradient descent algorithms for a number of classic machine learning

Sitecom WLM-4501 Full Manual English]

UVA CS 4501: Machine Learning Lecture 14: Neural Network ... · Network Topology, Network Parameters Training / Searching / Learning With new losses / with new optimization tricks

L eon Bottou Frank E. Curtis Jorge Nocedalz February 12, 2018 · 2021. 7. 18. · L eon Bottou Frank E. Curtisy Jorge Nocedalz February 12, 2018 Abstract This paper provides a review

4501 4505.output

UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 2 ... · UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 2: Algebra and Calculus Review Dr. Yanjun Qi University of

UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 22 ...€¦ · UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 22: Review Dr. Yanjun Qi University of Virginia Department

Deep Learning Part II - GitHub PagesGradient-based learning applied to document recognition”, by Yann LeCun, Leon Bottou, Yoshua Bengio and Patrick Haffner, in Proceedings of the