From Binary to Multiclass Classification · 2020. 1. 14. · From Binary to Multiclass...
Transcript of From Binary to Multiclass Classification · 2020. 1. 14. · From Binary to Multiclass...
CS6355:StructuredPrediction
FromBinarytoMulticlassClassification
1
Wehaveseenbinaryclassification
• Wehaveseenlinearmodels• Learningalgorithms– Perceptron– SVM– LogisticRegression
• Predictionissimple– Givenanexample 𝐱,output= sgn(𝐰𝑇𝐱)– Outputisasinglebit
2
Whatifwehavemorethantwolabels?
3
Readingfornextlecture:
ErinL.Allwein,RobertE.Schapire,Yoram Singer, ReducingMulticlasstoBinary:AUnifyingApproachforMarginClassifiers,ICML2000.
4
Multiclassclassification
• Introduction
• Combiningbinaryclassifiers– One-vs-all– All-vs-all– Errorcorrectingcodes
• Trainingasingleclassifier– MulticlassSVM– Constraintclassification
5
Wherearewe?
• Introduction
• Combiningbinaryclassifiers– One-vs-all– All-vs-all– Errorcorrectingcodes
• Trainingasingleclassifier– MulticlassSVM– Constraintclassification
6
Whatismulticlassclassification?
• AninputcanbelongtooneofKclasses
• Trainingdata:examplesassociatedwithclasslabel(anumberfrom1toK)
• Prediction:Givenanewinput,predicttheclasslabel
Eachinputbelongstoexactlyoneclass.Notmore,notless.• Otherwise,theproblemisnotmulticlassclassification
• Ifaninputcanbeassignedmultiplelabels(thinktagsforemailsratherthanfolders),itiscalledmulti-labelclassification
7
Exampleapplications:Images
– Input:hand-writtencharacter;Output:whichcharacter?
– Input:aphotographofanobject;Output:whichofasetofcategoriesofobjectsisit?• Eg:theCaltech256dataset
8
allmaptotheletterA
Cartire Cartire Duck laptop
Exampleapplications:Language
• Input:anewsarticle• Output:Whichsectionofthenewspapershouldbebein
• Input:anemail• Output:whichfoldershouldanemailbeplacedinto
• Input:anaudiocommandgiventoacar• Output:whichofasetofactionsshouldbeexecuted
9
Wherearewe?
• Introduction
• Combiningbinaryclassifiers– One-vs-all– All-vs-all– Errorcorrectingcodes
• Trainingasingleclassifier– MulticlassSVM– Constraintclassification
10
Binarytomulticlass
• Canweuseanalgorithmfortrainingbinaryclassifierstoconstructamulticlassclassifier?– Answer:Decomposethepredictionintomultiplebinarydecisions
• Howtodecompose?– One-vs-all– All-vs-all– Errorcorrectingcodes
11
Generalsetting
• Input𝐱 ∈ ℜ-– Theinputsarerepresentedbytheirfeaturevectors
• Output𝐲 ∈ 1,2,⋯ ,𝐾– Theseclassesrepresentdomain-specificlabels
• Learning:Givenadataset𝐷 = {(𝐱𝑖, 𝐲𝑖)}– NeedalearningalgorithmthatusesDtoconstructafunctionthatcan
predict𝐱 to 𝐲– Goal:findapredictorthatdoeswellonthetrainingdataandhaslow
generalizationerror
• Prediction/Inference:Givenanexample𝐱 andthelearnedfunction,computetheclasslabelfor 𝐱
12
1.One-vs-allclassification
• Assumption:Eachclassindividuallyseparablefromall theothers
• Learning:Givenadataset𝐷 = {(𝐱𝑖, 𝐲𝑖)}– DecomposeintoKbinaryclassificationtasks– Forclassk,constructabinaryclassificationtaskas:
• Positiveexamples:ElementsofDwithlabelk• Negativeexamples:AllotherelementsofD
– TrainKbinaryclassifiersw1,w2,! wK usinganylearningalgorithmwehaveseen
13
𝒙 ∈ ℜ-𝒚 ∈ 1,2,⋯ , 𝐾
1.One-vs-allclassification
• Assumption:Eachclassindividuallyseparablefromall theothers
• Learning:Givenadataset𝐷 = {(𝐱𝑖, 𝐲𝑖)}– DecomposeintoKbinaryclassificationtasks– Forclassk,constructabinaryclassificationtaskas:
• Positiveexamples:ElementsofDwithlabelk• Negativeexamples:AllotherelementsofD
– TrainKbinaryclassifiersw1,w2,! wK usinganylearningalgorithmwehaveseen
14
𝐱 ∈ ℜ-𝐲 ∈ 1,2,⋯ , 𝐾
1.One-vs-allclassification
• Assumption:Eachclassindividuallyseparablefromall theothers
• Learning:Givenadataset𝐷 = {(𝐱i, 𝐲𝑖)}– TrainKbinaryclassifiersw1,w2,! wK usinganylearningalgorithmwehaveseen
• Prediction:“WinnerTakesAll”argmax𝑖𝐰𝑖
𝑇𝐱
15
𝒙 ∈ ℜ-𝒚 ∈ 1,2,⋯ , 𝐾
1.One-vs-allclassification
• Assumption:Eachclassindividuallyseparablefromall theothers
• Learning:Givenadataset𝐷 = {(𝐱i, 𝐲𝑖)}– TrainKbinaryclassifiersw1,w2,! wK usinganylearningalgorithmwehaveseen
• Prediction:“WinnerTakesAll”argmax𝑖𝐰𝑖
𝑇𝐱
16
𝒙 ∈ ℜ-𝒚 ∈ 1,2,⋯ , 𝐾
Question:Whatisthedimensionalityofeachwi?
VisualizingOne-vs-all
17
VisualizingOne-vs-all
Fromthefulldataset,constructthreebinaryclassifiers,oneforeachclass
18
VisualizingOne-vs-all
Fromthefulldataset,constructthreebinaryclassifiers,oneforeachclass
19
wblueTx >0
forblueinputs
VisualizingOne-vs-all
Fromthefulldataset,constructthreebinaryclassifiers,oneforeachclass
20
wblueTx >0
forblueinputs
wredTx >0
forredinputs
wgreenTx >0
forgreeninputs
VisualizingOne-vs-all
Fromthefulldataset,constructthreebinaryclassifiers,oneforeachclass
21
wblueTx >0
forblueinputs
wredTx >0
forredinputs
wgreenTx >0
forgreeninputs
Notation:Scoreforbluelabel
VisualizingOne-vs-all
Fromthefulldataset,constructthreebinaryclassifiers,oneforeachclass
22
wblueTx >0
forblueinputs
wredTx >0
forredinputs
wgreenTx >0
forgreeninputs
Notation:Scoreforbluelabel
WinnerTakeAllwillpredicttherightanswer.Onlythecorrectlabelwillhaveapositivescore
One-vs-allmaynotalwaysworkBlackpointsarenotseparablewithasinglebinaryclassifier
Thedecompositionwillnotworkforthesecases!
wblueTx >0
forblueinputs
wredTx >0
forredinputs
wgreenTx >0
forgreeninputs
???
23
One-vs-allclassification:Summary
• Easytolearn– Useanybinaryclassifierlearningalgorithm
• Problems– Notheoreticaljustification– Calibrationissues
• WearecomparingscoresproducedbyKclassifierstrainedindependently.Noreasonforthescorestobeinthesamenumericalrange!
– Mightnotalwayswork• Yet,worksfairlywellinmanycases,especiallyiftheunderlyingbinaryclassifiersaretuned,regularized
24
2.All-vs-allclassification
• Assumption:Every pairofclassesisseparable
Sometimescalledone-vs-one
25
2.All-vs-allclassification
• Assumption:Every pairofclassesisseparable
• Learning:Givenadataset𝐷 = {(𝐱𝒊, 𝐲𝑖)},– Foreverypairoflabels(j,k),createabinaryclassifierwith:
• Positiveexamples:Allexampleswithlabelj• Negativeexamples:Allexampleswithlabelk
– Train 𝐾2 = @(@AB)C
classifierstoseparateeverypairoflabelsfromeachother
Sometimescalledone-vs-one
26
𝐱 ∈ ℜ-𝐲 ∈ 1,2,⋯ , 𝐾
2.All-vs-allclassification
• Assumption:Every pairofclassesisseparable
• Learning:Givenadataset𝐷 = {(𝐱𝒊, 𝐲𝑖)},– Train 𝐾2 = @(@AB)
Cclassifierstoseparateeverypairof
labelsfromeachother
• Prediction:Morecomplex,eachlabelgetK-1votes– Howtocombinethevotes?Manymethods
• Majority:Pickthelabelwithmaximumvotes• Organizeatournamentbetweenthelabels
Sometimescalledone-vs-one
27
𝐱 ∈ ℜ-𝐲 ∈ 1,2,⋯ , 𝐾
All-vs-allclassification
• Everypairoflabelsislinearlyseparablehere– Whenapairoflabelsisconsidered,allothersareignored
• Problems1. O(K2)weightvectorstotrainandstore
2. Sizeoftrainingsetforapairoflabelscouldbeverysmall,leadingtooverfittingofthebinaryclassifiers
3. Predictionisoftenad-hocandmightbeunstableEg:Whatiftwoclassesgetthesamenumberofvotes?Foratournament,whatisthesequenceinwhichthelabelscompete?
28
3.Errorcorrectingoutputcodes(ECOC)
• Eachbinaryclassifierprovidesonebitofinformation
• WithKlabels,weonlyneedlog2Kbitstorepresentthelabel– One-vs-allusesK bits(oneperclassifier)– All-vs-allusesO(K2)bits
• CanwegetbywithO(logK)classifiers?– Yes! Encodeeachlabelasabinarystring– Oralternatively,ifwedotrainmorethanO(logK)classifiers,can
weusetheredundancytoimproveclassificationaccuracy?
29
Usinglog2Kclassifiers
• Learning:– Representeachlabelbyabitstring(i.e.,itscode)– Trainonebinaryclassifierforeachbit
• Prediction:– Usethepredictionsfromalltheclassifierstocreatealog2Nbit
stringthatuniquelydecidestheoutput
• Whatcouldgowronghere?– Evenifoneoftheclassifiersmakesamistake,finalpredictionis
wrong!
30
label# Code
0 0 0 0
1 0 0 1
2 0 1 0
3 0 1 1
4 1 0 0
5 1 0 1
6 1 1 0
7 1 1 1
8 classes,code-length=3
Example:Forsomeexample,ifthethreeclassifierspredict0,1 and1,thenthelabelis3
Usinglog2Kclassifiers
• Learning:– Representeachlabelbyabitstring(i.e.,itscode)– Trainonebinaryclassifierforeachbit
• Prediction:– Usethepredictionsfromalltheclassifierstocreatealog2Nbit
stringthatuniquelydecidestheoutput
• Whatcouldgowronghere?– Evenifoneoftheclassifiersmakesamistake,finalpredictionis
wrong!
31
label# Code
0 0 0 0
1 0 0 1
2 0 1 0
3 0 1 1
4 1 0 0
5 1 0 1
6 1 1 0
7 1 1 1
8 classes,code-length=3
Usinglog2Kclassifiers
• Learning:– Representeachlabelbyabitstring(i.e.,itscode)– Trainonebinaryclassifierforeachbit
• Prediction:– Usethepredictionsfromalltheclassifierstocreatealog2Nbit
stringthatuniquelydecidestheoutput
• Whatcouldgowronghere?– Evenifoneoftheclassifiersmakesamistake,finalpredictionis
wrong!
32
label# Code
0 0 0 0
1 0 0 1
2 0 1 0
3 0 1 1
4 1 0 0
5 1 0 1
6 1 1 0
7 1 1 1
8 classes,code-length=3
Errorcorrectingoutputcoding
Answer:Useredundancy• Assignabinarystringwitheachlabel
– Couldberandom– LengthofthecodewordL >=log2Kisaparameter
• Trainonebinaryclassifierforeachbit– Effectively,splitthedataintorandomdichotomies– Weneedonlylog2Kbits
• Additionalbitsactasanerrorcorrectingcode
33
8 classes,code-length=5
# Code
0 0 0 0 0 0
1 0 0 1 1 0
2 0 1 0 1 1
3 0 1 1 0 1
4 1 0 0 1 1
5 1 0 1 0 0
6 1 1 0 0 0
7 1 1 1 1 1
Howtopredict?
• Prediction– RunallL binaryclassifiersontheexample– GivesusapredictedbitstringoflengthL– Output=labelwhosecodewordis“closest”to
theprediction– ClosestdefinedusingHammingdistance
• Longercodelengthisbetter,bettererror-correction
• Example– Supposethebinaryclassifiersherepredict11010– Theclosestlabeltothisis6,withcodeword11000
34
8 classes,code-length=5
# Code
0 0 0 0 0 0
1 0 0 1 1 0
2 0 1 0 1 1
3 0 1 1 0 1
4 1 0 0 1 1
5 1 0 1 0 0
6 1 1 0 0 0
7 1 1 1 1 1
Howtopredict?
• Prediction– RunallL binaryclassifiersontheexample– GivesusapredictedbitstringoflengthL– Output=labelwhosecodewordis“closest”to
theprediction– ClosestdefinedusingHammingdistance
• Longercodelengthisbetter,bettererror-correction
• Example– Supposethebinaryclassifiersherepredict11010– Theclosestlabeltothisis6,withcodeword11000
35
8 classes,code-length=5
# Code
0 0 0 0 0 0
1 0 0 1 1 0
2 0 1 0 1 1
3 0 1 1 0 1
4 1 0 0 1 1
5 1 0 1 0 0
6 1 1 0 0 0
7 1 1 1 1 1
One-vs-allisaspecialcaseofthisscheme.How?
Errorcorrectingcodes:Discussion
• Assumesthatcolumnsareindependent– Otherwise,ineffectiveencoding
• Strongtheoreticalresultsthatdependoncodelength– IfminimalHammingdistancebetweentworowsisd,thenthe
predictioncancorrectupto(d-1)/2errorsinthebinarypredictions
• Codeassignmentcouldberandom,ordesignedforthedataset/task
• One-vs-allandall-vs-allarespecialcases– All-vs-allneedsaternarycode(notbinary)
36
Errorcorrectingcodes:Discussion
• Assumesthatcolumnsareindependent– Otherwise,ineffectiveencoding
• Strongtheoreticalresultsthatdependoncodelength– IfminimalHammingdistancebetweentworowsisd,thenthe
predictioncancorrectupto(d-1)/2errorsinthebinarypredictions
• Codeassignmentcouldberandom,ordesignedforthedataset/task
• One-vs-allandall-vs-allarespecialcases– All-vs-allneedsaternarycode(notbinary)
37
Exercise:Convinceyourselfthatthisiscorrect
Decompositionmethods:Summary
• Generalidea– Decomposethemulticlassproblemintomanybinaryproblems– Weknowhowtotrainbinaryclassifiers– Predictiondependsonthedecomposition
• Constructsthemulticlasslabelfromtheoutputofthebinaryclassifiers
• Learningoptimizeslocalcorrectness– Eachbinaryclassifierdoesnotneedtobegloballycorrect
• Thatis,theclassifiersdonothavetoagreewitheachother– Thelearningalgorithmisnotevenawareofthepredictionprocedure!
• Poordecompositiongivespoorperformance– Difficultlocalproblems,canbe“unnatural”
• Eg.ForECOC,whyshouldthebinaryproblemsbeseparable?
38
Wherearewe?
• Introduction
• Combiningbinaryclassifiers– One-vs-all– All-vs-all– Errorcorrectingcodes
• Trainingasingleclassifier– MulticlassSVM– Constraintclassification
39
Motivation
• Decompositionmethods– Donotaccountforhowthefinalpredictorwillbeused– Donotoptimizeanyglobalmeasureofcorrectness
• Goal:Totrainamulticlassclassifierthatis“global”
40
Recall:Marginforbinaryclassifiers
Themargin ofahyperplaneforadataset:thedistancebetweenthehyperplaneandthedatapointnearesttoit
41
++
++
+ +++
-- --
-- -- --
---- --
--
Marginwithrespecttothishyperplane
Multiclassmargin
Definedasthescoredifferencebetweenthehighestscoringlabelandthesecondone
42
Labels
Scoreforalabel
Blue
Red
Green
Black
=wlabelTx
Multiclassmargin
Definedasthescoredifferencebetweenthehighestscoringlabelandthesecondone
43
Labels
Scoreforalabel
Blue
Red
Green
Black
=wlabelTx
MulticlassMargin
MulticlassSVM(Intuition)
• Recall:BinarySVM– Maximizemargin– Equivalently,
Minimizenormofweightssuchthattheclosestpointstothehyperplanehaveascore±1
• MulticlassSVM– Eachlabelhasadifferentweightvector(likeone-vs-all)– Maximizemulticlassmargin– Equivalently,
Minimizetotalnormoftheweightssuchthatthetruelabelisscoredatleast1morethanthesecondbestone
44
MulticlassSVMintheseparablecase
45
RecallhardbinarySVM
𝑠𝑐𝑜𝑟𝑒 𝑦J – 𝑠𝑐𝑜𝑟𝑒 𝑘 ≥ 1
𝑅𝑒𝑔𝑢𝑙𝑎𝑟𝑖𝑧𝑒𝑟 𝐰B,⋯ ,𝒘@
MulticlassSVMintheseparablecase
46
RecallhardbinarySVM
𝑅𝑒𝑔𝑢𝑙𝑎𝑟𝑖𝑧𝑒𝑟 𝐰B,⋯ ,𝒘@
MulticlassSVMintheseparablecase
47
RecallhardbinarySVM
MulticlassSVMintheseparablecase
48
RecallhardbinarySVM
Thescoreforthetruelabelishigherthanthescoreforany otherlabelby1
MulticlassSVMintheseparablecase
49
RecallhardbinarySVM
Thescoreforthetruelabelishigherthanthescoreforany otherlabelby1
Sizeoftheweights.Effectively,regularizer
MulticlassSVMintheseparablecase
50
RecallhardbinarySVM
Thescoreforthetruelabelishigherthanthescoreforany otherlabelby1
Sizeoftheweights.Effectively,regularizer
Problemswiththis?
MulticlassSVMintheseparablecase
51
RecallhardbinarySVM
Thescoreforthetruelabelishigherthanthescoreforanyotherlabelby1
Sizeoftheweights.Effectively,regularizer
Problemswiththis?
Whatifthereisnosetofweightsthatachievesthisseparation?Thatis,whatifthedataisnotlinearlyseparable?
MulticlassSVM:Generalcase
52
Sizeoftheweights.Effectively,regularizer
Thescoreforthetruelabelishigherthanthescoreforany otherlabelby1- »i
Slackvariables.Notallexamplesneedtosatisfythemargin
constraint.
MulticlassSVM:Generalcase
53
Sizeoftheweights.Effectively,regularizer
Thescoreforthetruelabelishigherthanthescoreforany otherlabelby1- »i
Slackvariables.Notallexamplesneedtosatisfythemargin
constraint.
Totalslack.Don’tallowtoomanyexamplestoviolatethemargin
constraint
MulticlassSVM:Generalcase
54
Sizeoftheweights.Effectively,regularizer
Thescoreforthetruelabelishigherthanthescoreforany otherlabelby1- »i
Slackvariables.Notallexamplesneedtosatisfythemargin
constraint.
Totalslack.Don’tallowtoomanyexamplestoviolatethemargin
constraint
Slackvariablescanonlybepositive
MulticlassSVM:Generalcase
55
Sizeoftheweights.Effectively,regularizer
Thescoreforthetruelabelishigherthanthescoreforany otherlabelby1- »i
Slackvariables.Notallexamplesneedtosatisfythemargin
constraint.
Totalslack.Don’tallowtoomanyexamplestoviolatethemargin
constraint
Slackvariablescanonlybepositive
MulticlassSVM:Generalcase
56
Thescoreforthetruelabelishigherthanthescoreforany otherlabelby1- »i
Sizeoftheweights.Effectively,regularizer
Slackvariables.Notallexamplesneedtosatisfythemargin
constraint.
Totalslack.Don’tallowtoomanyexamplestoviolatethemargin
constraint
Slackvariablescanonlybepositive
MulticlassSVM:Generalcase
57
Solving
Isequivalenttosolving
min𝐰U,𝐰V,⋯,𝐰W
12X𝐰J
Y𝐰J + 𝐶 X max 0,max]^𝐲_
𝐰]Y𝐱J − 𝐰𝐲_
Y 𝐱J + 1�
(𝐱_,𝐲_)∈b
�
J
Why?
MulticlassSVM:Generalcase
58
min𝐰U,𝐰V,⋯,𝐰W
12X𝐰J
Y𝐰J + 𝐶 X max 0,max]^𝐲_
𝐰]Y𝐱J − 𝐰𝐲_
Y 𝐱J + 1�
(𝐱_,𝐲_)∈b
�
J
Sizeoftheweights.Effectively,regularizer
MulticlassSVM:Generalcase
59
min𝐰U,𝐰V,⋯,𝐰W
12X𝐰J
Y𝐰J + 𝐶 X max 0,max]^𝐲_
𝐰]Y𝐱J − 𝐰𝐲_
Y 𝐱J + 1�
(𝐱_,𝐲_)∈b
�
J
Sizeoftheweights.Effectively,regularizer Themulticlasshingeloss
MulticlassSVM:Generalcase
60
min𝐰U,𝐰V,⋯,𝐰W
12X𝐰J
Y𝐰J + 𝐶 X max 0,max]^𝐲_
𝐰]Y𝐱J − 𝐰𝐲_
Y 𝐱J + 1�
(𝐱_,𝐲_)∈b
�
J
Sizeoftheweights.Effectively,regularizer Themulticlasshingeloss
Thetradeoffhyperparameter
MulticlassSVM
• GeneralizesbinarySVMalgorithm– Ifwehaveonlytwoclasses,thisreducestothebinary(uptoscale)
• ComeswithsimilargeneralizationguaranteesasthebinarySVM
• Canbetrainedusingdifferentoptimizationmethods– Stochasticsub-gradientdescentcanbegeneralized
• Tryasexercise
61
MulticlassSVM:Summary
• Training:– OptimizetheSVMobjective
• Prediction:– Winnertakesall
argmaxi wiTx
• WithKlabelsandinputsin<n,wehavenK weightsinall– Sameasone-vs-all
– Butcomeswithguarantees!
62Questions?
Wherearewe?
• Introduction
• Combiningbinaryclassifiers– One-vs-all– All-vs-all– Errorcorrectingcodes
• Trainingasingleclassifier– MulticlassSVM– Constraintclassification
63
Letusexamineone-vs-allagain
• Training:– CreateKbinaryclassifiersw1,w2,…,wK
– wi separatesclassi fromallothers
• Prediction:argmaxi wiTx
• Observations:1. Attrainingtime,werequirewi
Tx tobepositiveforexamplesofclassi.
2. Really,allweneedis forwiTx tobemorethanallothers
Therequirementofbeingpositiveismorestrict
64
Rewriteinputsandweightvector• Stackallweightvectorsintoan
nK-dimensionalvector
• Defineafeaturevectorforlabeli beingassociatedtoinputx:
LinearSeparability withmultipleclasses
65
xintheith block,zeroseverywhereelse
Forexampleswithlabeli,wewantwiTx >wj
Tx forallj
Rewriteinputsandweightvector• Stackallweightvectorsintoan
nK-dimensionalvector
• Defineafeaturevectorforlabeli beingassociatedtoinputx:
LinearSeparability withmultipleclasses
66
xintheith block,zeroseverywhereelse
Forexampleswithlabeli,wewantwiTx >wj
Tx forallj
ThisiscalledtheKesler construction
LinearSeparability withmultipleclasses
Equivalentrequirement:
67
xintheith block,zeroseverywhereelse
Forexampleswithlabeli,wewantwiTx >wj
Tx forallj
Or:
LinearSeparability withmultipleclasses
68
ithblock
Forexampleswithlabeli,wewantwiTx >wj
Tx foralljOrequivalently:
LinearSeparability withmultipleclasses
69
ithblock
Foreveryexample(x,i)indataset,allotherlabelsj
Positiveexamples Negativeexamples
Thatis,thefollowingbinarytaskinnK dimensionsthatshouldbelinearlyseparable
Forexampleswithlabeli,wewantwiTx >wj
Tx foralljOrequivalently:
ConstraintClassification
• Training:– Givenadataset{(x,y)},createabinaryclassificationtask
• Positiveexamples:Á(x,y)- Á(x,y’)• Negativeexamples:Á(x, y’)- Á(x,y)foreveryexample,foreveryy’≠y
– Useyourfavoritealgorithmtotrainabinaryclassifier
• Prediction:GivenanK dimensionalweightvectorwandanewexamplex
argmaxy wT Á(x,y)
70
ConstraintClassification
• Training:– Givenadataset{(x,y)},createabinaryclassificationtask
• Positiveexamples:Á(x,y)- Á(x,y’)• Negativeexamples:Á(x, y’)- Á(x,y)foreveryexample,foreveryy’≠y
– Useyourfavoritealgorithmtotrainabinaryclassifier
• Prediction:GivenanK dimensionalweightvectorwandanewexamplex
argmaxy wT Á(x,y)
71
ConstraintClassification
• Training:– Givenadataset{(x,y)},createabinaryclassificationtask
• Positiveexamples:Á(x,y)- Á(x,y’)• Negativeexamples:Á(x, y’)- Á(x,y)foreveryexample,foreveryy’≠y
– Useyourfavoritealgorithmtotrainabinaryclassifier
• Prediction:GivenanK dimensionalweightvectorwandanewexamplex
argmaxy wT Á(x,y)
72
Exercise:WhatdotheperceptronupdaterulelooklikeintermsoftheÁs?Interprettheupdatestep
ConstraintClassification
• Training:– Givenadataset{(x,y)},createabinaryclassificationtask
• Positiveexamples:Á(x,y)- Á(x,y’)• Negativeexamples:Á(x, y’)- Á(x,y)foreveryexample,foreveryy’≠y
– Useyourfavoritealgorithmtotrainabinaryclassifier
• Prediction:GivenanK dimensionalweightvectorwandanewexamplex
argmaxy wT Á(x,y)
73
Note:Thebinaryclassificationtaskonlyexpressespreferencesoverlabelassignments
Thisapproachextendstotrainingaranker,canusepartialpreferencestoo,moreonthislater…
Asecondlookatthemulticlassmargin
74
Definedasthescoredifferencebetweenthehighestscoringlabelandthesecondone
Labels
Scoreforalabel
Blue
Red
Green
Black
MulticlassMargin
Asecondlookatthemulticlassmargin
75
Definedasthescoredifferencebetweenthehighestscoringlabelandthesecondone
Labels
Scoreforalabel
Blue
Red
Green
Black
MulticlassMarginIntermsofKeslerconstruction
Herey isthelabelthathasthehighestscore
Discussion
• ThenumberofweightsformulticlassSVMandconstraintclassificationisstillsameasOne-vs-all,muchlessthanall-vs-allK(K-1)/2
• Butbothstillaccountforallpairwiselabelpreferences– MulticlassSVMviathedefinitionofthelearningobjective
– Constraintclassificationbyconstructingabinaryclassificationproblem
• Bothcomewiththeoreticalguaranteesforgeneralization
• Importantideathatisapplicablewhenwemovetoarbitrarystructures
76Questions?
Trainingmulticlassclassifiers:Wrap-up
• Labelbelongstoasetthathasmorethantwoelements
• Methods– Decompositionintoacollectionofbinary(local)decisions
• One-vs-all• All-vs-all• Errorcorrectingcodes
– Trainingasingle(global)classifier• MulticlassSVM• Constraintclassification
• Exercise:Whichofthesewillworkforthiscase?
77Questions?
Nextsteps…
• Builduptostructuredprediction– Multiclassisreallyasimplestructure
• Differentaspectsofstructuredprediction– Decidingthestructure,training,inference
• Sequencemodels
78