Search-Guided, Lightly-Supervised Training of Structured … · 2020. 9. 20. · Structured...

Post on 22-Jan-2021

2 views 0 download

Transcript of Search-Guided, Lightly-Supervised Training of Structured … · 2020. 9. 20. · Structured...

Search-Guided,Lightly-SupervisedTrainingofStructuredPredictionEnergyNetworks

AndrewMcCallumPedram Rooshenas Dongxu Zhang Gopal Sharma

StructuredPrediction

• Weareinterestedtolearnafunction• Xinputvariables• Youtputvariables

• Wecandefineas• ForaGibbsdistribution:

StructuredPredictionEnergyNetworks(SPENs)

• Ifisparameterizedusingadifferentiablemodelsuchasadeepneuralnetwork:• WecanfindalocalminimumofEusinggradientdescent

• Theenergynetworksexpressthecorrelationamonginputandoutputvariables.• Traditionallygraphicalmodelsareusedforrepresentingthecorrelationamongoutputvariables.• Inference isintractable formostofexpressive graphicalmodels

EnergyModels

[picture from Belanger (2016)]

[picture from Altinel (2018)]

TrainingSPENs

• StructuralSVM(BelangerandMcCallum,2016)• End-to-End(Belangeretal.,2017)• Value-basedtraining(Gygli etal.2017)• InferenceNetwork(Lifu Tu andKevinGimpel,2018)• Rank-BasedTraining(Rooshenasetal.,2018)

IndirectSupervision• Dataannotationisexpensive,especiallyforstructuredoutputs.• Domainknowledge asthesourceofsupervision.

• Itcanbewrittenasrewardfunctions• evaluatesapairofinputandoutputconfigurationintoascalarvalue• Foragivenx,wearelookingforthebestythatmaximize

6

Search-GuidedTraining

Wehaveareward function thatprovides indirect supervision

Search-GuidedTraining

Wehaveareward function thatprovides indirect supervision

Wewanttolearnasmooth versionof the rewardfunctionsuch thatwecanusegradient-descent inference attesttime

Search-GuidedTraining

y0

Wesample apoint from energy function using noisygradient-descent inference

Search-GuidedTraining

y0

y1

Wesample apoint from energy function using noisygradient-descent inference

Search-GuidedTraining

y0

y2

y1

Wesample apoint from energy function using noisygradient-descent inference

Search-GuidedTraining

y0

y2

y3y1

Wesample apoint from energy function using noisygradient-descent inference

Search-GuidedTraining

y0

y2

y3y1

y4

Wesample apoint from energy function using noisygradient-descent inference

Search-GuidedTraining

y0

y2

y3y1

y4y5

Wesample apoint from energy function using noisygradient-descent inference

Search-GuidedTraining

y0

y2

y3y1

y4y5

Thenweproject thesample tothedomain ofthe rewardfunction(thesample isapoint inthesimplex,but thedomain ofthe rewardfunction isoften discrete, i.e.,theverticesof thesimplex)

Search-GuidedTraining

y0

y2

y3y1

y4y5

Then thesearchprocedure usesthesampleasinput andreturns anoutput structure bysearching therewardfunction

Search-GuidedTraining

y0

y2

y3y1

y4y5

Weexpectthatthe twopoints havethesamerankingon thereward function andnegative oftheenergy function

Search-GuidedTraining

y0

y2

y3y1

y4y5

Rankingviolation

Weexpectthatthe twopoints havethesamerankingon thereward function andnegative oftheenergy function

Search-GuidedTraining

y0

y2

y3y1

y4y5

Whenwefind apairofpoints thatviolates theranking constraints,weupdate theenergy function towards reducing theviolation

Task-LossasRewardFunctionforMulti-LabelClassification• Thesimplestformofindirectsupervisionistousetask-lossasrewardfunction:

DomainKnowledgeasRewardFunctionforCitationFieldExtraction

24

DomainKnowledgeasRewardFunctionforCitationFieldExtraction

25

DomainKnowledgeasRewardFunctionforCitationFieldExtraction

26

DomainKnowledgeasRewardFunctionforCitationFieldExtraction

27

EnergyModel

0.9

0.9

0.85

0.4

0.1

0.05

0.05

0.04

0.1

0.45

0.8

0.9

... ...

Input embedding

Tagdistribution

Convolutional layer with multiple filters

and differentwindow sizes

Max pooling and

concatenation Multi-layer perceptron

Tokens

WeiLi.

DeepLearning

for

...

Energy

...

...

...

...

...

...

...

author title ...

Filter

size

Filte

r siz

e

PerformanceonCitationFieldExtraction

Semi-SupervisedSetting• Alternativelyusetheoutputofsearchandground-truthlabelfortraining.

ShapeParser

I

+

-

c(32,32,28) c(32,32,24)

t(32,32,20)

Parsing

ShapeParser

+

-

c(32,32,28) c(32,32,24)

t(32,32,20)

Parsing

I

Predict

+

-

c(32,32,28) c(32,32,24)

t(32,32,20)

Parsing

ShapeParser

+

-

c(32,32,28) c(32,32,24)

t(32,32,20)

Parsing

+

-

c(32,32,28) c(32,32,24)

t(32,32,20)

Parsing

+

-

c(32,32,28) c(32,32,24)

t(32,32,20)

Parsing

GraphicEngine

I O

Predict

+

-

c(32,32,28) c(32,32,24)

t(32,32,20)

Parsing

ShapeParser

+

-

c(32,32,28) c(32,32,24)

t(32,32,20)

Parsing

+

-

c(32,32,28) c(32,32,24)

t(32,32,20)

Parsing

+

-

c(32,32,28) c(32,32,24)

t(32,32,20)

Parsing

GraphicEngine

I O

Predict

+

-

c(32,32,28) c(32,32,24)

t(32,32,20)

Parsing

ShapeParserEnergyModel

0.8

1e-5

1e-5

0.01

1e-5

...

...

...

...

...

Convolutional layer

 

Program

circle(16,16,12)triangle(32,48,16)

+

circle(16,24,12)­

Energy

1e-5

1e-5

1e-3

1e-5

0.9

circle(16,16,12) -...

CNN

Output

distribution

Input  

image Multi-layer perceptron

SearchBudgetvs.Constraints

PerformanceonShapeParser

ConclusionandFutureDirections

• Ifarewardfunctionexiststoevaluateeverystructuredoutputintoascalarvalue• Wecanuseunlabled datafortrainingstructuredpredictionenergynetworks

• Domainknowledgeornon-differentiablepipelinescanbeusedtodefinetherewardfunctions.• Themainingredientforlearningfromtherewardfunctionisthesearchoperator.• Hereweonlyusesimplesearchoperators,butmorecomplexsearchfunctionsderivedfromdomainknowledgecanbeusedforcomplicatedproblems.