From Binary to Multiclass Classification · 2020. 1. 14. · From Binary to Multiclass...

CS6355:StructuredPrediction

FromBinarytoMulticlassClassification

1

Wehaveseenbinaryclassification

• Wehaveseenlinearmodels• Learningalgorithms– Perceptron– SVM– LogisticRegression

• Predictionissimple– Givenanexample 𝐱,output= sgn(𝐰𝑇𝐱)– Outputisasinglebit

2

Whatifwehavemorethantwolabels?

3

Readingfornextlecture:

ErinL.Allwein,RobertE.Schapire,Yoram Singer, ReducingMulticlasstoBinary:AUnifyingApproachforMarginClassifiers,ICML2000.

4

Multiclassclassification

• Introduction

• Combiningbinaryclassifiers– One-vs-all– All-vs-all– Errorcorrectingcodes

• Trainingasingleclassifier– MulticlassSVM– Constraintclassification

5

Wherearewe?

• Introduction



6

Whatismulticlassclassification?

• AninputcanbelongtooneofKclasses

• Trainingdata:examplesassociatedwithclasslabel(anumberfrom1toK)

• Prediction:Givenanewinput,predicttheclasslabel

Eachinputbelongstoexactlyoneclass.Notmore,notless.• Otherwise,theproblemisnotmulticlassclassification

• Ifaninputcanbeassignedmultiplelabels(thinktagsforemailsratherthanfolders),itiscalledmulti-labelclassification

7

Exampleapplications:Images

– Input:hand-writtencharacter;Output:whichcharacter?

– Input:aphotographofanobject;Output:whichofasetofcategoriesofobjectsisit?• Eg:theCaltech256dataset

8

allmaptotheletterA

Cartire Cartire Duck laptop

Exampleapplications:Language

• Input:anewsarticle• Output:Whichsectionofthenewspapershouldbebein

• Input:anemail• Output:whichfoldershouldanemailbeplacedinto

• Input:anaudiocommandgiventoacar• Output:whichofasetofactionsshouldbeexecuted

9

Wherearewe?

• Introduction



10

Binarytomulticlass

• Canweuseanalgorithmfortrainingbinaryclassifierstoconstructamulticlassclassifier?– Answer:Decomposethepredictionintomultiplebinarydecisions

• Howtodecompose?– One-vs-all– All-vs-all– Errorcorrectingcodes

11

Generalsetting

• Input𝐱 ∈ ℜ-– Theinputsarerepresentedbytheirfeaturevectors

• Output𝐲 ∈ 1,2,⋯ ,𝐾– Theseclassesrepresentdomain-specificlabels

• Learning:Givenadataset𝐷 = {(𝐱𝑖, 𝐲𝑖)}– NeedalearningalgorithmthatusesDtoconstructafunctionthatcan

predict𝐱 to 𝐲– Goal:findapredictorthatdoeswellonthetrainingdataandhaslow

generalizationerror

• Prediction/Inference:Givenanexample𝐱 andthelearnedfunction,computetheclasslabelfor 𝐱

12

1.One-vs-allclassification

• Assumption:Eachclassindividuallyseparablefromall theothers

• Learning:Givenadataset𝐷 = {(𝐱𝑖, 𝐲𝑖)}– DecomposeintoKbinaryclassificationtasks– Forclassk,constructabinaryclassificationtaskas:

• Positiveexamples:ElementsofDwithlabelk• Negativeexamples:AllotherelementsofD

– TrainKbinaryclassifiersw1,w2,! wK usinganylearningalgorithmwehaveseen

13

𝒙 ∈ ℜ-𝒚 ∈ 1,2,⋯ , 𝐾



• Learning:Givenadataset𝐷 = {(𝐱𝑖, 𝐲𝑖)}– DecomposeintoKbinaryclassificationtasks– Forclassk,constructabinaryclassificationtaskas:

• Positiveexamples:ElementsofDwithlabelk• Negativeexamples:AllotherelementsofD

– TrainKbinaryclassifiersw1,w2,! wK usinganylearningalgorithmwehaveseen

14

𝐱 ∈ ℜ-𝐲 ∈ 1,2,⋯ , 𝐾



• Learning:Givenadataset𝐷 = {(𝐱i, 𝐲𝑖)}– TrainKbinaryclassifiersw1,w2,! wK usinganylearningalgorithmwehaveseen

• Prediction:“WinnerTakesAll”argmax𝑖𝐰𝑖

𝑇𝐱

15

𝒙 ∈ ℜ-𝒚 ∈ 1,2,⋯ , 𝐾



• Learning:Givenadataset𝐷 = {(𝐱i, 𝐲𝑖)}– TrainKbinaryclassifiersw1,w2,! wK usinganylearningalgorithmwehaveseen

• Prediction:“WinnerTakesAll”argmax𝑖𝐰𝑖

𝑇𝐱

16

𝒙 ∈ ℜ-𝒚 ∈ 1,2,⋯ , 𝐾

Question:Whatisthedimensionalityofeachwi?

VisualizingOne-vs-all

17


Fromthefulldataset,constructthreebinaryclassifiers,oneforeachclass

18



19

wblueTx >0

forblueinputs



20

wblueTx >0

forblueinputs

wredTx >0

forredinputs

wgreenTx >0

forgreeninputs



21

wblueTx >0

forblueinputs

wredTx >0

forredinputs

wgreenTx >0

forgreeninputs

Notation:Scoreforbluelabel



22

wblueTx >0

forblueinputs

wredTx >0

forredinputs

wgreenTx >0

forgreeninputs

Notation:Scoreforbluelabel

WinnerTakeAllwillpredicttherightanswer.Onlythecorrectlabelwillhaveapositivescore

One-vs-allmaynotalwaysworkBlackpointsarenotseparablewithasinglebinaryclassifier

Thedecompositionwillnotworkforthesecases!

wblueTx >0

forblueinputs

wredTx >0

forredinputs

wgreenTx >0

forgreeninputs

???

23

One-vs-allclassification:Summary

• Easytolearn– Useanybinaryclassifierlearningalgorithm

• Problems– Notheoreticaljustification– Calibrationissues

• WearecomparingscoresproducedbyKclassifierstrainedindependently.Noreasonforthescorestobeinthesamenumericalrange!

– Mightnotalwayswork• Yet,worksfairlywellinmanycases,especiallyiftheunderlyingbinaryclassifiersaretuned,regularized

24

2.All-vs-allclassification

• Assumption:Every pairofclassesisseparable

Sometimescalledone-vs-one

25



• Learning:Givenadataset𝐷 = {(𝐱𝒊, 𝐲𝑖)},– Foreverypairoflabels(j,k),createabinaryclassifierwith:

• Positiveexamples:Allexampleswithlabelj• Negativeexamples:Allexampleswithlabelk

– Train 𝐾2 = @(@AB)C

classifierstoseparateeverypairoflabelsfromeachother


26

𝐱 ∈ ℜ-𝐲 ∈ 1,2,⋯ , 𝐾



• Learning:Givenadataset𝐷 = {(𝐱𝒊, 𝐲𝑖)},– Train 𝐾2 = @(@AB)

Cclassifierstoseparateeverypairof

labelsfromeachother

• Prediction:Morecomplex,eachlabelgetK-1votes– Howtocombinethevotes?Manymethods

• Majority:Pickthelabelwithmaximumvotes• Organizeatournamentbetweenthelabels


27

𝐱 ∈ ℜ-𝐲 ∈ 1,2,⋯ , 𝐾

All-vs-allclassification

• Everypairoflabelsislinearlyseparablehere– Whenapairoflabelsisconsidered,allothersareignored

• Problems1. O(K2)weightvectorstotrainandstore

2. Sizeoftrainingsetforapairoflabelscouldbeverysmall,leadingtooverfittingofthebinaryclassifiers

3. Predictionisoftenad-hocandmightbeunstableEg:Whatiftwoclassesgetthesamenumberofvotes?Foratournament,whatisthesequenceinwhichthelabelscompete?

28

3.Errorcorrectingoutputcodes(ECOC)

• Eachbinaryclassifierprovidesonebitofinformation

• WithKlabels,weonlyneedlog2Kbitstorepresentthelabel– One-vs-allusesK bits(oneperclassifier)– All-vs-allusesO(K2)bits

• CanwegetbywithO(logK)classifiers?– Yes! Encodeeachlabelasabinarystring– Oralternatively,ifwedotrainmorethanO(logK)classifiers,can

weusetheredundancytoimproveclassificationaccuracy?

29

Usinglog2Kclassifiers

• Learning:– Representeachlabelbyabitstring(i.e.,itscode)– Trainonebinaryclassifierforeachbit

• Prediction:– Usethepredictionsfromalltheclassifierstocreatealog2Nbit

stringthatuniquelydecidestheoutput

• Whatcouldgowronghere?– Evenifoneoftheclassifiersmakesamistake,finalpredictionis

wrong!

30

label# Code

0 0 0 0

1 0 0 1

2 0 1 0

3 0 1 1

4 1 0 0

5 1 0 1

6 1 1 0

7 1 1 1

8 classes,code-length=3

Example:Forsomeexample,ifthethreeclassifierspredict0,1 and1,thenthelabelis3






wrong!

31

label# Code

0 0 0 0

1 0 0 1

2 0 1 0

3 0 1 1

4 1 0 0

5 1 0 1

6 1 1 0

7 1 1 1







wrong!

32

label# Code

0 0 0 0

1 0 0 1

2 0 1 0

3 0 1 1

4 1 0 0

5 1 0 1

6 1 1 0

7 1 1 1


Errorcorrectingoutputcoding

Answer:Useredundancy• Assignabinarystringwitheachlabel

– Couldberandom– LengthofthecodewordL >=log2Kisaparameter

• Trainonebinaryclassifierforeachbit– Effectively,splitthedataintorandomdichotomies– Weneedonlylog2Kbits

• Additionalbitsactasanerrorcorrectingcode

33


# Code

0 0 0 0 0 0

1 0 0 1 1 0

2 0 1 0 1 1

3 0 1 1 0 1

4 1 0 0 1 1

5 1 0 1 0 0

6 1 1 0 0 0

7 1 1 1 1 1

Howtopredict?

• Prediction– RunallL binaryclassifiersontheexample– GivesusapredictedbitstringoflengthL– Output=labelwhosecodewordis“closest”to

theprediction– ClosestdefinedusingHammingdistance

• Longercodelengthisbetter,bettererror-correction

• Example– Supposethebinaryclassifiersherepredict11010– Theclosestlabeltothisis6,withcodeword11000

34


# Code

0 0 0 0 0 0

1 0 0 1 1 0

2 0 1 0 1 1

3 0 1 1 0 1

4 1 0 0 1 1

5 1 0 1 0 0

6 1 1 0 0 0

7 1 1 1 1 1

Howtopredict?

• Prediction– RunallL binaryclassifiersontheexample– GivesusapredictedbitstringoflengthL– Output=labelwhosecodewordis“closest”to

theprediction– ClosestdefinedusingHammingdistance

• Longercodelengthisbetter,bettererror-correction

• Example– Supposethebinaryclassifiersherepredict11010– Theclosestlabeltothisis6,withcodeword11000

35


# Code

0 0 0 0 0 0

1 0 0 1 1 0

2 0 1 0 1 1

3 0 1 1 0 1

4 1 0 0 1 1

5 1 0 1 0 0

6 1 1 0 0 0

7 1 1 1 1 1

One-vs-allisaspecialcaseofthisscheme.How?

Errorcorrectingcodes:Discussion

• Assumesthatcolumnsareindependent– Otherwise,ineffectiveencoding

• Strongtheoreticalresultsthatdependoncodelength– IfminimalHammingdistancebetweentworowsisd,thenthe

predictioncancorrectupto(d-1)/2errorsinthebinarypredictions

• Codeassignmentcouldberandom,ordesignedforthedataset/task

• One-vs-allandall-vs-allarespecialcases– All-vs-allneedsaternarycode(notbinary)

36

Errorcorrectingcodes:Discussion

• Assumesthatcolumnsareindependent– Otherwise,ineffectiveencoding

• Strongtheoreticalresultsthatdependoncodelength– IfminimalHammingdistancebetweentworowsisd,thenthe

predictioncancorrectupto(d-1)/2errorsinthebinarypredictions

• Codeassignmentcouldberandom,ordesignedforthedataset/task

• One-vs-allandall-vs-allarespecialcases– All-vs-allneedsaternarycode(notbinary)

37

Exercise:Convinceyourselfthatthisiscorrect

Decompositionmethods:Summary

• Generalidea– Decomposethemulticlassproblemintomanybinaryproblems– Weknowhowtotrainbinaryclassifiers– Predictiondependsonthedecomposition

• Constructsthemulticlasslabelfromtheoutputofthebinaryclassifiers

• Learningoptimizeslocalcorrectness– Eachbinaryclassifierdoesnotneedtobegloballycorrect

• Thatis,theclassifiersdonothavetoagreewitheachother– Thelearningalgorithmisnotevenawareofthepredictionprocedure!

• Poordecompositiongivespoorperformance– Difficultlocalproblems,canbe“unnatural”

• Eg.ForECOC,whyshouldthebinaryproblemsbeseparable?

38

Wherearewe?

• Introduction



39

Motivation

• Decompositionmethods– Donotaccountforhowthefinalpredictorwillbeused– Donotoptimizeanyglobalmeasureofcorrectness

• Goal:Totrainamulticlassclassifierthatis“global”

40

Recall:Marginforbinaryclassifiers

Themargin ofahyperplaneforadataset:thedistancebetweenthehyperplaneandthedatapointnearesttoit

41

++

++

+ +++

-- --

-- -- --

---- --

--

Marginwithrespecttothishyperplane

Multiclassmargin

Definedasthescoredifferencebetweenthehighestscoringlabelandthesecondone

42

Labels

Scoreforalabel

Blue

Red

Green

Black

=wlabelTx

Multiclassmargin


43

Labels

Scoreforalabel

Blue

Red

Green

Black

=wlabelTx

MulticlassMargin

MulticlassSVM(Intuition)

• Recall:BinarySVM– Maximizemargin– Equivalently,

Minimizenormofweightssuchthattheclosestpointstothehyperplanehaveascore±1

• MulticlassSVM– Eachlabelhasadifferentweightvector(likeone-vs-all)– Maximizemulticlassmargin– Equivalently,

Minimizetotalnormoftheweightssuchthatthetruelabelisscoredatleast1morethanthesecondbestone

44

MulticlassSVMintheseparablecase

45

RecallhardbinarySVM

𝑠𝑐𝑜𝑟𝑒 𝑦J – 𝑠𝑐𝑜𝑟𝑒 𝑘 ≥ 1

𝑅𝑒𝑔𝑢𝑙𝑎𝑟𝑖𝑧𝑒𝑟 𝐰B,⋯ ,𝒘@


46

RecallhardbinarySVM

𝑅𝑒𝑔𝑢𝑙𝑎𝑟𝑖𝑧𝑒𝑟 𝐰B,⋯ ,𝒘@


47

RecallhardbinarySVM


48

RecallhardbinarySVM

Thescoreforthetruelabelishigherthanthescoreforany otherlabelby1


49

RecallhardbinarySVM


Sizeoftheweights.Effectively,regularizer


50

RecallhardbinarySVM



Problemswiththis?


51

RecallhardbinarySVM

Thescoreforthetruelabelishigherthanthescoreforanyotherlabelby1


Problemswiththis?

Whatifthereisnosetofweightsthatachievesthisseparation?Thatis,whatifthedataisnotlinearlyseparable?

MulticlassSVM:Generalcase

52


Thescoreforthetruelabelishigherthanthescoreforany otherlabelby1- »i

Slackvariables.Notallexamplesneedtosatisfythemargin

constraint.


53




constraint.

Totalslack.Don’tallowtoomanyexamplestoviolatethemargin

constraint


54




constraint.


constraint

Slackvariablescanonlybepositive


55




constraint.


constraint



56




constraint.


constraint



57

Solving

Isequivalenttosolving

min𝐰U,𝐰V,⋯,𝐰W

12X𝐰J

Y𝐰J + 𝐶 X max 0,max]^𝐲_

𝐰]Y𝐱J − 𝐰𝐲_

Y 𝐱J + 1�

(𝐱_,𝐲_)∈b

�

J

Why?


58


12X𝐰J



Y 𝐱J + 1�

(𝐱_,𝐲_)∈b

�

J



59


12X𝐰J



Y 𝐱J + 1�

(𝐱_,𝐲_)∈b

�

J

Sizeoftheweights.Effectively,regularizer Themulticlasshingeloss


60


12X𝐰J



Y 𝐱J + 1�

(𝐱_,𝐲_)∈b

�

J

Sizeoftheweights.Effectively,regularizer Themulticlasshingeloss

Thetradeoffhyperparameter

MulticlassSVM

• GeneralizesbinarySVMalgorithm– Ifwehaveonlytwoclasses,thisreducestothebinary(uptoscale)

• ComeswithsimilargeneralizationguaranteesasthebinarySVM

• Canbetrainedusingdifferentoptimizationmethods– Stochasticsub-gradientdescentcanbegeneralized

• Tryasexercise

61

MulticlassSVM:Summary

• Training:– OptimizetheSVMobjective

• Prediction:– Winnertakesall

argmaxi wiTx

• WithKlabelsandinputsin<n,wehavenK weightsinall– Sameasone-vs-all

– Butcomeswithguarantees!

62Questions?

Wherearewe?

• Introduction



63

Letusexamineone-vs-allagain

• Training:– CreateKbinaryclassifiersw1,w2,…,wK

– wi separatesclassi fromallothers

• Prediction:argmaxi wiTx

• Observations:1. Attrainingtime,werequirewi

Tx tobepositiveforexamplesofclassi.

2. Really,allweneedis forwiTx tobemorethanallothers

Therequirementofbeingpositiveismorestrict

64

Rewriteinputsandweightvector• Stackallweightvectorsintoan

nK-dimensionalvector

• Defineafeaturevectorforlabeli beingassociatedtoinputx:

LinearSeparability withmultipleclasses

65

xintheith block,zeroseverywhereelse

Forexampleswithlabeli,wewantwiTx >wj

Tx forallj

Rewriteinputsandweightvector• Stackallweightvectorsintoan

nK-dimensionalvector

• Defineafeaturevectorforlabeli beingassociatedtoinputx:


66



Tx forallj

ThisiscalledtheKesler construction


Equivalentrequirement:

67



Tx forallj

Or:


68

ithblock


Tx foralljOrequivalently:


69

ithblock

Foreveryexample(x,i)indataset,allotherlabelsj

Positiveexamples Negativeexamples

Thatis,thefollowingbinarytaskinnK dimensionsthatshouldbelinearlyseparable


Tx foralljOrequivalently:

ConstraintClassification

• Training:– Givenadataset{(x,y)},createabinaryclassificationtask

• Positiveexamples:Á(x,y)- Á(x,y’)• Negativeexamples:Á(x, y’)- Á(x,y)foreveryexample,foreveryy’≠y

– Useyourfavoritealgorithmtotrainabinaryclassifier

• Prediction:GivenanK dimensionalweightvectorwandanewexamplex

argmaxy wT Á(x,y)

70






argmaxy wT Á(x,y)

71






argmaxy wT Á(x,y)

72

Exercise:WhatdotheperceptronupdaterulelooklikeintermsoftheÁs?Interprettheupdatestep






argmaxy wT Á(x,y)

73

Note:Thebinaryclassificationtaskonlyexpressespreferencesoverlabelassignments

Thisapproachextendstotrainingaranker,canusepartialpreferencestoo,moreonthislater…

Asecondlookatthemulticlassmargin

74


Labels

Scoreforalabel

Blue

Red

Green

Black

MulticlassMargin

Asecondlookatthemulticlassmargin

75


Labels

Scoreforalabel

Blue

Red

Green

Black

MulticlassMarginIntermsofKeslerconstruction

Herey isthelabelthathasthehighestscore

Discussion

• ThenumberofweightsformulticlassSVMandconstraintclassificationisstillsameasOne-vs-all,muchlessthanall-vs-allK(K-1)/2

• Butbothstillaccountforallpairwiselabelpreferences– MulticlassSVMviathedefinitionofthelearningobjective

– Constraintclassificationbyconstructingabinaryclassificationproblem

• Bothcomewiththeoreticalguaranteesforgeneralization

• Importantideathatisapplicablewhenwemovetoarbitrarystructures

76Questions?

Trainingmulticlassclassifiers:Wrap-up

• Labelbelongstoasetthathasmorethantwoelements

• Methods– Decompositionintoacollectionofbinary(local)decisions

• One-vs-all• All-vs-all• Errorcorrectingcodes

– Trainingasingle(global)classifier• MulticlassSVM• Constraintclassification

• Exercise:Whichofthesewillworkforthiscase?

77Questions?

Nextsteps…

• Builduptostructuredprediction– Multiclassisreallyasimplestructure

• Differentaspectsofstructuredprediction– Decidingthestructure,training,inference

• Sequencemodels

78

From Binary to Multiclass Classification · 2020. 1. 14. · From Binary to Multiclass...

Documents

Transcript of From Binary to Multiclass Classification · 2020. 1. 14. · From Binary to Multiclass...