Search-Guided, Lightly-Supervised Training of Structured … · 2020. 9. 20. · Structured...
Transcript of Search-Guided, Lightly-Supervised Training of Structured … · 2020. 9. 20. · Structured...
Search-Guided,Lightly-SupervisedTrainingofStructuredPredictionEnergyNetworks
AndrewMcCallumPedram Rooshenas Dongxu Zhang Gopal Sharma
StructuredPrediction
• Weareinterestedtolearnafunction• Xinputvariables• Youtputvariables
• Wecandefineas• ForaGibbsdistribution:
StructuredPredictionEnergyNetworks(SPENs)
• Ifisparameterizedusingadifferentiablemodelsuchasadeepneuralnetwork:• WecanfindalocalminimumofEusinggradientdescent
• Theenergynetworksexpressthecorrelationamonginputandoutputvariables.• Traditionallygraphicalmodelsareusedforrepresentingthecorrelationamongoutputvariables.• Inference isintractable formostofexpressive graphicalmodels
EnergyModels
[picture from Belanger (2016)]
[picture from Altinel (2018)]
TrainingSPENs
• StructuralSVM(BelangerandMcCallum,2016)• End-to-End(Belangeretal.,2017)• Value-basedtraining(Gygli etal.2017)• InferenceNetwork(Lifu Tu andKevinGimpel,2018)• Rank-BasedTraining(Rooshenasetal.,2018)
IndirectSupervision• Dataannotationisexpensive,especiallyforstructuredoutputs.• Domainknowledge asthesourceofsupervision.
• Itcanbewrittenasrewardfunctions• evaluatesapairofinputandoutputconfigurationintoascalarvalue• Foragivenx,wearelookingforthebestythatmaximize
6
Search-GuidedTraining
Wehaveareward function thatprovides indirect supervision
Search-GuidedTraining
Wehaveareward function thatprovides indirect supervision
Wewanttolearnasmooth versionof the rewardfunctionsuch thatwecanusegradient-descent inference attesttime
Search-GuidedTraining
y0
Wesample apoint from energy function using noisygradient-descent inference
Search-GuidedTraining
y0
y1
Wesample apoint from energy function using noisygradient-descent inference
Search-GuidedTraining
y0
y2
y1
Wesample apoint from energy function using noisygradient-descent inference
Search-GuidedTraining
y0
y2
y3y1
Wesample apoint from energy function using noisygradient-descent inference
Search-GuidedTraining
y0
y2
y3y1
y4
Wesample apoint from energy function using noisygradient-descent inference
Search-GuidedTraining
y0
y2
y3y1
y4y5
Wesample apoint from energy function using noisygradient-descent inference
Search-GuidedTraining
y0
y2
y3y1
y4y5
Thenweproject thesample tothedomain ofthe rewardfunction(thesample isapoint inthesimplex,but thedomain ofthe rewardfunction isoften discrete, i.e.,theverticesof thesimplex)
Search-GuidedTraining
y0
y2
y3y1
y4y5
Then thesearchprocedure usesthesampleasinput andreturns anoutput structure bysearching therewardfunction
Search-GuidedTraining
y0
y2
y3y1
y4y5
Weexpectthatthe twopoints havethesamerankingon thereward function andnegative oftheenergy function
Search-GuidedTraining
y0
y2
y3y1
y4y5
Rankingviolation
Weexpectthatthe twopoints havethesamerankingon thereward function andnegative oftheenergy function
Search-GuidedTraining
y0
y2
y3y1
y4y5
Whenwefind apairofpoints thatviolates theranking constraints,weupdate theenergy function towards reducing theviolation
Task-LossasRewardFunctionforMulti-LabelClassification• Thesimplestformofindirectsupervisionistousetask-lossasrewardfunction:
DomainKnowledgeasRewardFunctionforCitationFieldExtraction
24
DomainKnowledgeasRewardFunctionforCitationFieldExtraction
25
DomainKnowledgeasRewardFunctionforCitationFieldExtraction
26
DomainKnowledgeasRewardFunctionforCitationFieldExtraction
27
EnergyModel
0.9
0.9
0.85
0.4
0.1
0.05
0.05
0.04
0.1
0.45
0.8
0.9
... ...
Input embedding
Tagdistribution
Convolutional layer with multiple filters
and differentwindow sizes
Max pooling and
concatenation Multi-layer perceptron
Tokens
WeiLi.
DeepLearning
for
...
Energy
...
...
...
...
...
...
...
author title ...
Filter
size
Filte
r siz
e
PerformanceonCitationFieldExtraction
Semi-SupervisedSetting• Alternativelyusetheoutputofsearchandground-truthlabelfortraining.
ShapeParser
I
+
-
c(32,32,28) c(32,32,24)
t(32,32,20)
Parsing
ShapeParser
+
-
c(32,32,28) c(32,32,24)
t(32,32,20)
Parsing
I
Predict
+
-
c(32,32,28) c(32,32,24)
t(32,32,20)
Parsing
ShapeParser
+
-
c(32,32,28) c(32,32,24)
t(32,32,20)
Parsing
+
-
c(32,32,28) c(32,32,24)
t(32,32,20)
Parsing
+
-
c(32,32,28) c(32,32,24)
t(32,32,20)
Parsing
GraphicEngine
I O
Predict
+
-
c(32,32,28) c(32,32,24)
t(32,32,20)
Parsing
ShapeParser
+
-
c(32,32,28) c(32,32,24)
t(32,32,20)
Parsing
+
-
c(32,32,28) c(32,32,24)
t(32,32,20)
Parsing
+
-
c(32,32,28) c(32,32,24)
t(32,32,20)
Parsing
GraphicEngine
I O
Predict
+
-
c(32,32,28) c(32,32,24)
t(32,32,20)
Parsing
ShapeParserEnergyModel
0.8
1e-5
1e-5
0.01
1e-5
...
...
...
...
...
Convolutional layer
Program
circle(16,16,12)triangle(32,48,16)
+
circle(16,24,12)
Energy
1e-5
1e-5
1e-3
1e-5
0.9
circle(16,16,12) -...
CNN
Output
distribution
Input
image Multi-layer perceptron
SearchBudgetvs.Constraints
PerformanceonShapeParser
ConclusionandFutureDirections
• Ifarewardfunctionexiststoevaluateeverystructuredoutputintoascalarvalue• Wecanuseunlabled datafortrainingstructuredpredictionenergynetworks
• Domainknowledgeornon-differentiablepipelinescanbeusedtodefinetherewardfunctions.• Themainingredientforlearningfromtherewardfunctionisthesearchoperator.• Hereweonlyusesimplesearchoperators,butmorecomplexsearchfunctionsderivedfromdomainknowledgecanbeusedforcomplicatedproblems.