Mul$class Classifica$on
Transcript of Mul$class Classifica$on
Administrivia
‣ ProblemSet1Graded(onGradescope)
‣ ProgrammingProject1isreleased(due9/20)
‣ Reading:Eisenstein2.0-2.5,4.1,4.3-4.5
‣ Op$onalreadingsrelatedtoProject1werepostedbyTAonPiazza
En$tyLinking
‣ 4,500,000classes(allar$clesinWikipedia)
Althoughheoriginallywontheevent,theUnitedStatesAn$-DopingAgencyannouncedinAugust2012thattheyhaddisqualifiedArmstrongfromhissevenconsecu$veTourdeFrancewinsfrom1999–2005.
LanceEdwardArmstrongisanAmericanformerprofessionalroadcyclist
ArmstrongCountyisacountyinPennsylvania…
??
BinaryClassifica$on
‣ Binaryclassifica$on:oneweightvectordefinesposi$veandnega$veclasses
+++ +
++
++
- - -----
--
Mul$classClassifica$on
‣ One-vs-all:trainkclassifiers,onetodis$nguisheachclassfromalltherest
1 11 11 12
222
333
3
1 11 11 12
222
333
3
‣ Howdowereconcilemul$pleposi$vepredic$ons?Highestscore?
Mul$classClassifica$on
‣ Notallclassesmayevenbeseparableusingthisapproach
1 11 11 12 2
222 2
33
3 33 3
‣ Canseparate1from2+3and2from1+3butnot3fromtheothers(withthesefeatures)
Mul$classClassifica$on‣ All-vs-all:trainn(n-1)/2classifierstodifferen$ateeachpairofclasses
1 11 11 1
333
3
1 11 11 12
222
‣ Again,howtoreconcile?
Mul$classClassifica$on
+++ +
++
++
- - -----
--
1 11 11 12
222
333
3
‣ Binaryclassifica$on:oneweightvectordefinesbothclasses
‣Mul$classclassifica$on:differentweightsand/orfeaturesperclass
Mul$classClassifica$on
‣ Decisionrule:
‣ Canalsohaveoneweightvectorperclass:
‣ Formally:insteadoftwolabels,wehaveanoutputspacecontaininganumberofpossibleclasses
Y
‣ Samemachinerythatwe’lluselaterforexponen$allylargeoutputspaces,includingsequencesandtrees
argmaxy2Yw>y f(x)
argmaxy2Yw>f(x, y)
‣ Thesingleweightvectorapproachwillgeneralizetostructuredoutputspaces,whereasper-classweightvectorswon’t
‣Mul$plefeaturevectors,oneweightvector
featuresdependonchoiceoflabelnow!note:thisisn’tthegoldlabel
BlockFeatureVectors‣ Decisionrule:argmaxy2Yw
>f(x, y)
toomanydrugtrials,toofewpa5ents
Health
Sports
Science
f(x)=I[containsdrug],I[containspa5ents],I[containsbaseball] =[1,1,0]
[1,1,0,0,0,0,0,0,0]
[0,0,0,1,1,0,0,0,0]
f(x, y = ) =Health
f(x, y = ) =Sports
‣ Equivalenttohavingthreeweightvectorsinthiscase
featurevectorblocksforeachlabel
‣ Basefeaturefunc$on:
I[containsdrug&label=Health]
MakingDecisions
f(x) =I[containsdrug],I[containspa5ents],I[containsbaseball]
w = [+2.1,+2.3,-5,-2.1,-3.8,0,+1.1,-1.7,-1.3]
= Health:+4.4 Sports:-5.9 Science:-0.6
argmax
toomanydrugtrials,toofewpa5ents
Health
Sports
Science
[1,1,0,0,0,0,0,0,0]
[0,0,0,1,1,0,0,0,0]
f(x, y = ) =Health
f(x, y = ) =Sports “worddruginSciencear$cle”=+1.1
w>f(x, y)
Anotherexample:POStaggingblocksNNSVBZNNDT…
f(x,y=VBZ)=I[curr_word=blocks&tag=VBZ],I[prev_word=router&tag=VBZ]I[next_word=the&tag=VBZ]I[curr_suffix=s&tag=VBZ]
‣ Classifyblocksasoneof36POStags
‣ Nexttwolectures:sequencelabeling!
‣ Examplex:sentencewithaword(inthiscase,blocks)highlighted
‣ Extractfeatureswithrespecttothisword:
notsayingthattheistaggedasVBZ!sayingthatthefollowstheVBZword
therouter thepackets
Mul$classLogis$cRegression
‣ Comparetobinary:
nega$veclassimplicitlyhadf(x,y=0)=thezerovector
sumoveroutputspacetonormalize
P (y = 1|x) = exp(w>f(x))
1 + exp(w>f(x))
Pw(y|x) =exp
�w>f(x, y)
�P
y02Y exp (w>f(x, y0))
Softmax function
Mul$classLogis$cRegression
sumoveroutputspacetonormalize
Pw(y|x) =exp
�w>f(x, y)
�P
y02Y exp (w>f(x, y0))
Health:+2.2
Sports:+3.1
Science:-0.6w>f(x, y)
Why?Interpretrawclassifierscoresasprobabili/es
exp6.05
22.2
0.55
probabili$esmustbe>=0
unnormalizedprobabili$es
toomanydrugtrials,
toofewpa5ents
normalize0.21
0.77
0.02
probabili$esmustsumto1
Pw(y|x) =exp
�w>f(x, y)
�P
y02Y exp (w>f(x, y0))
Mul$classLogis$cRegression
sumoveroutputspacetonormalize
‣ Training:maximize
Pw(y|x) =exp
�w>f(x, y)
�P
y02Y exp (w>f(x, y0))
i.e.minimizenega$veloglikelihoodorcross-entropyloss
indexofdatapoints(j)
L(x, y) =nX
j=1
logP (y⇤j |xj)
=nX
j=1
w>f(xj , y
⇤j )� log
X
y
exp(w>f(xj , y))
!
Mul$classLogis$cRegression
sumoveroutputspacetonormalize
Pw(y|x) =exp
�w>f(x, y)
�P
y02Y exp (w>f(x, y0))
Health:+2.2
Sports:+3.1
Science:-0.6w>f(x, y)
exp6.05
22.2
0.55
probabili$esmustbe>=0
unnormalizedprobabili$es
1.00
0.00
0.00correct(gold)probabili$es
toomanydrugtrials,
toofewpa5ents
compare
L(x, y) =nX
j=1
logP (y⇤j |xj)L(xj , y⇤j ) = w>f(xj , y
⇤j )� log
X
y
exp(w>f(xj , y))
log(0.21)=-1.56
Q:max/minoflogprob.?
normalize0.21
0.77
0.02
probabili$esmustsumto1
Pw(y|x) =exp
�w>f(x, y)
�P
y02Y exp (w>f(x, y0))
Training
‣ Mul$classlogis$cregression
‣ Likelihood L(xj , y⇤j ) = w>f(xj , y
⇤j )� log
X
y
exp(w>f(xj , y))
@
@wiL(xj , y
⇤j ) = fi(xj , y
⇤j )�
Py fi(xj , y) exp(w>f(xj , y))P
y exp(w>f(xj , y))
@
@wiL(xj , y
⇤j ) = fi(xj , y
⇤j )� Ey[fi(xj , y)]
Pw(y|x) =exp
�w>f(x, y)
�P
y02Y exp (w>f(x, y0))
@
@wiL(xj , y
⇤j ) = fi(xj , y
⇤j )�
X
y
fi(xj , y)Pw(y|xj)
goldfeaturevalue
model’sexpecta$onoffeaturevalue
Training
toomanydrugtrials,toofewpa5ents[1,1,0,0,0,0,0,0,0]
[0,0,0,1,1,0,0,0,0]
f(x, y = ) =Health
f(x, y = ) =Sports
y*= Health
Pw(y|x)=[0.21,0.77,0.02]
@
@wiL(xj , y
⇤j ) = fi(xj , y
⇤j )�
X
y
fi(xj , y)Pw(y|xj)
[1,1,0,0,0,0,0,0,0] —0.21[1,1,0,0,0,0,0,0,0]—0.77[0,0,0,1,1,0,0,0,0] —0.02[0,0,0,0,0,0,1,1,0]
=[0.79,0.79,0,-0.77,-0.77,0,-0.02,-0.02,0]
gradient:
[1.3,0.9,-5,3.2,-0.1,0,1.1,-1.7,-1.3] +[0.79,0.79,0,-0.77,-0.77,0,-0.02,-0.02,0]=[2.09,1.69,0,2.43,-0.87,0,1.08,-1.72,0]
newPw(y|x)=[0.89,0.10,0.01]
update:w>f(x, y) + `(y, y⇤)
Mul$classLogis$cRegression:Summary
‣Model:
‣ Learning:gradientascentonthediscrimina$velog-likelihood
‣ Inference:
“towardsgoldfeaturevalue,awayfromexpecta$onoffeaturevalue”
f(x, y⇤)� Ey[f(x, y)] = f(x, y⇤)�X
y
[Pw(y|x)f(x, y)]
Pw(y|x) =exp
�w>f(x, y)
�P
y02Y exp (w>f(x, y0))
argmaxyPw(y|x)
SozMarginSVM
Minimize
s.t.
8j (2yj � 1)(w>xj) � 1� ⇠j
8j ⇠j � 0
�kwk22 +mX
j=1
⇠jslackvariables>0iffexampleissupportvector
Image credit: Lang Van Tran
Mul$classSVM
Correctpredic$onnowhastobeateveryotherclass
Minimize
s.t.
8j (2yj � 1)(w>xj) � 1� ⇠j
8j ⇠j � 0
�kwk22 +mX
j=1
⇠j
8j8y 2 Y w>f(xj , y⇤j ) � w>f(xj , y) + `(y, y⇤j )� ⇠j
The1thatwashereisreplacedbyalossfunc$on
Scorecomparisonismoreexplicitnow
slackvariables>0iffexampleissupportvector
Training(loss-augmented)
‣ Arealldecisionsequallycostly?
‣Wecandefinealossfunc$on `(y, y⇤)
toomanydrugtrials,toofewpa5ents
Health
SportsSports
ScienceSports
Science
Predicted
Predicted :notsobad
:baderror
`( , ) =HealthSports
HealthScience`( , ) =
3
1
Loss-AugmentedDecoding
Health Science Sports
2.4+0
1.3+3
1.8+1
Y
‣ Doesgoldbeateverylabel+loss?No!
‣ =4.3-2.4=1.9⇠j
‣MostviolatedconstraintisSports;whatis?
8j8y 2 Y w>f(xj , y⇤j ) � w>f(xj , y) + `(y, y⇤j )� ⇠j
w>f(x, y) + `(y, y⇤)
‣ Perceptronwouldmakenoupdatehere
⇠j
Loss-AugmentedDecoding
‣ Sportsismostviolatedconstraint,slack=4.3—2.4=1.9
Health+2.4Sports+1.3
Science+1.8
toomanydrugtrials,toofewpa5ents
Loss031
⇠j = maxy2Y
w>f(xj , y) + `(y, y⇤j )� w>f(xj , y⇤j )
argmax
Total2.44.32.8
Health
w>f(x, y)
‣ Perceptronwouldmakenoupdate,regularSVMwouldpickScience
Mul$classSVM
Minimize
s.t. 8j ⇠j � 0
�kwk22 +mX
j=1
⇠j
8j8y 2 Y w>f(xj , y⇤j ) � w>f(xj , y) + `(y, y⇤j )� ⇠j
‣ Oneslackvariableperexample,soit’ssettobewhateverthemostviolatedconstraintisforthatexample
⇠j = maxy2Y
w>f(xj , y) + `(y, y⇤j )� w>f(xj , y⇤j )
‣ Pluginthegoldyandyouget0,soslackisalwaysnonnega$ve!
Compu$ngtheSubgradient
‣ If,theexampleisnotasupportvector,gradientiszero⇠j = 0
‣ Otherwise,
(updatelooksbackwards—we’reminimizinghere!)
@
@wi⇠j = fi(xj , ymax)� fi(xj , y
⇤j )
‣ Perceptron-like,butweupdateawayfrom*loss-augmented*predic$on
Minimize
s.t. 8j ⇠j � 0
�kwk22 +mX
j=1
⇠j
8j8y 2 Y w>f(xj , y⇤j ) � w>f(xj , y) + `(y, y⇤j )� ⇠j
⇠j = maxy2Y
w>f(xj , y) + `(y, y⇤j )� w>f(xj , y⇤j )
Pu|ngitTogether
Minimize
s.t. 8j ⇠j � 0
�kwk22 +mX
j=1
⇠j
8j8y 2 Y w>f(xj , y⇤j ) � w>f(xj , y) + `(y, y⇤j )� ⇠j
‣ (Unregularized)gradients:‣ SVM:
f(x, y⇤)� Ey[f(x, y)] = f(x, y⇤)�X
y
[Pw(y|x)f(x, y)]‣ Logreg:f(x, y⇤)� f(x, ymax) (loss-augmentedmax)
‣ SVM:maxoverstocomputegradient.LR:needtosumovers`(y, y⇤) `(y, y⇤)
SozmaxMargin‣ Canweincludealossfunc$oninlogis$cregression?
‣ Likelihoodisar$ficiallyhigherforthingswithhighloss—trainingneedstoworkevenhardertomaximizethelikelihoodoftherightthing!
rightanswer
highloss
lowloss
highloss
lowloss
‣ Biasedes$matorfororiginallikelihood,butbe}erlossGimpelandSmith(2010)
P (y|x) =exp
�w>f(x, y) + `(y, y⇤)
�P
y0 exp (w>f(x, y0) + `(y0, y⇤))
En$tyLinking
Althoughheoriginallywontheevent,theUnitedStatesAn$-DopingAgencyannouncedinAugust2012thattheyhaddisqualifiedArmstrongfromhissevenconsecu$veTourdeFrancewinsfrom1999–2005.
LanceEdwardArmstrongisanAmericanformerprofessionalroadcyclist
ArmstrongCountyisacountyinPennsylvania…
??
‣ Instead,featuresf(x,y)lookattheactualar$cleassociatedwithy
‣ 4.5Mclasses,notenoughdatatolearnfeatureslike“TourdeFrance<->en/wiki/Lance_Armstrong”
En$tyLinkingAlthoughheoriginallywontheevent,theUnitedStatesAn$-DopingAgencyannouncedinAugust2012thattheyhaddisqualifiedArmstrongfromhissevenconsecu$veTourdeFrancewinsfrom1999–2005.
LanceEdwardArmstrong ArmstrongCounty
‣ �-idf(doc,w)=freqofwindoc*log(4.5M/#Wikiar$cleswoccursin)‣ the:occursineveryar$cle,�-idf=0‣ cyclist:occursin1%ofar$cles,�-idf=#occurrences*log10(100)
‣ f(x,y)=[cos(�-idf(x),�-idf(y)),…otherfeatures]‣ �-idf(doc)=vectorof�-idf(doc,w)forallwordsinvocabulary(50,000)
Recap‣ Fourelementsofamachinelearningmethod:
‣Model:probabilis$c,max-margin,deepneuralnetwork
‣ Inference:justmaxesandsimpleexpecta$onssofar,butwillgetharder
‣ Training:gradientdescent?
‣ Objec$ve:
Op$miza$on
‣ Gradientdescent‣ Batchupdateforlogis$cregression‣ Eachupdateisbasedonacomputa$onovertheen$redataset
L
Lmin
L(w)
Lmin
w
Op$miza$on
‣ Gradientdescent‣ Batchupdateforlogis$cregression‣ Eachupdateisbasedonacomputa$onovertheen$redataset
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201820
Optimization
W_1
W_2
‣ Verysimpletocodeup
Op$miza$on
‣ Stochas/cgradientdescent
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201821
Optimization: Problems with SGDWhat if loss changes quickly in one direction and slowly in another?What does gradient descent do?
Loss function has high condition number: ratio of largest to smallest singular value of the Hessian matrix is large
w w + ↵g, g =@
@wL
Q:Whatiflosschangesquicklyinonedirec$onandslowlyinanotherdirec$on?
‣ Approx.gradientiscomputedonasingleinstance
contourplot
Op$miza$on
‣ Stochas/cgradientdescentw w + ↵g, g =
@
@wL
Q:Whatiflosschangesquicklyinonedirec$onandslowlyinanotherdirec$on?
‣ Approx.gradientiscomputedonasingleinstance
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201822
Optimization: Problems with SGDWhat if loss changes quickly in one direction and slowly in another?What does gradient descent do?Very slow progress along shallow dimension, jitter along steep direction
Loss function has high condition number: ratio of largest to smallest singular value of the Hessian matrix is large
Op$miza$on
‣ Stochas/cgradientdescent‣ Verysimpletocodeup
w w + ↵g, g =@
@wL
‣Whatifthelossfunc$onhasalocalminimaorsaddlepoint?
“Iden$fyinganda}ackingthesaddlepointprobleminhigh-dimensionalnon-convexop$miza$on”Dauphinetal.(2014)
Op$miza$on
‣ Stochas$cgradientdescent‣ Verysimpletocodeup
‣ “First-order”technique:onlyreliesonhavinggradient
w w + ↵g, g =@
@wL
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201854
First-Order Optimization
Loss
w1
(1) Use gradient form linear approximation(2) Step to minimize the approximation
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201855
Second-Order Optimization
Loss
w1
(1) Use gradient and Hessian to form quadratic approximation(2) Step to the minima of the approximation
Op$miza$on(extracurricular)
‣ Stochas$cgradientdescent‣ Verysimpletocodeup
‣ “First-order”technique:onlyreliesonhavinggradient
‣ Newton’smethod
‣ Second-ordertechnique
InverseHessian:nxnmat,expensive!‣ Op$mizesquadra$cinstantly
‣ Quasi-Newtonmethods:L-BFGS,etc.approximateinverseHessian
‣ Se|ngstepsizeishard(decreasewhenheld-outperformanceworsens?)
w w + ↵g, g =@
@wL
w w +
✓@2
@w2L◆�1
g
AdaGrad(extracurricular)
Duchietal.(2011)
‣ Op$mizedforproblemswithsparsefeatures
‣ Per-parameterlearningrate:smallerupdatesaremadetoparametersthatgetupdatedfrequently
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201837
AdaGrad
Added element-wise scaling of the gradient based on the historical sum of squares in each dimension
“Per-parameter learning rates” or “adaptive learning rates”
Duchi et al, “Adaptive subgradient methods for online learning and stochastic optimization”, JMLR 2011
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201821
Optimization: Problems with SGDWhat if loss changes quickly in one direction and slowly in another?What does gradient descent do?
Loss function has high condition number: ratio of largest to smallest singular value of the Hessian matrix is large
AdaGrad(extracurricular)
Duchietal.(2011)
‣ Op$mizedforproblemswithsparsefeatures
‣ Per-parameterlearningrate:smallerupdatesaremadetoparametersthatgetupdatedfrequently
(smoothed)sumofsquaredgradientsfromallupdates
‣ GenerallymorerobustthanSGD,requireslesstuningoflearningrate
‣ Othertechniquesforop$mizingdeepmodels—morelater!
wi wi + ↵1q
✏+Pt
⌧=1 g2⌧,i
gti
Summary
‣ Designtradeoffsneedtoreflectinterac$ons:
‣Modelandobjec$vearecoupled:probabilis$cmodel<->maximizelikelihood
‣ …butnotalways:alinearmodelorneuralnetworkcanbetrainedtominimizeanydifferen$ablelossfunc$on
‣ Inferencegovernswhatlearning:needtobeabletocomputeexpecta$onstouselogis$cregression