SeqGAN - Lantao Yu

SeqGANSequence Generative Adversarial

Nets with Policy Gradient

Feb.72017 atAAAI

Lantao Yu

†, Weinan Zhang

†, Jun Wang

‡, Yong Yu

††Shanghai Jiao Tong University,

‡University College London

Email: [email protected]

ProblemDefinition• Givenadataset of real-world structured sequences,train a -parameterized generative model toproduce sequences that mimic the real ones.

• In other words, we want to fit the unknown truedata distribution , which is onlyrevealed by the given dataset .

✓ G✓

G✓ptrue(yt|Y1:t�1)

D = {Y1:T }

• Traditionalobjective:maximumlikelihoodestimation(MLE)

• Checkwhetheratruedataiswithahighmassdensityofthelearnedmodel

• Suffer from so-called in the inference stag:exposure bias

Training InferenceWhen generating the nexttoken , sample from:yt

max

✓EY⇠ptrue

X

t

logG✓(yt|Y1:t�1) G✓(yt|Y1:t�1)

Update the model as follows:

The real prefix The guessed prefix

max

✓

1

|D|X

Y1:T2D

X

t

log[G✓(yt|Y1:t�1)]

A promising method:GenerativeAdversarialNets(GANs)

• Discriminatortriestocorrectlydistinguishthetruedataandthefakemodel-generateddata• Generatortriestogeneratehigh-qualitydatatofooldiscriminator• Ideally,whenDcannotdistinguishthetrueandgenerateddata,Gnicelyfitsthetrueunderlyingdatadistribution

GD

RealWorld

Generator

Discriminator

Data

[Goodfellow I,Pouget-Abadie J,MirzaM,etal. 2014.Generativeadversarialnets.InNIPS2014.]

GeneratorNetworkinGANs

• Mustbedifferentiable• Popularimplementation:multi-layerperceptron• Linkedwiththediscriminatorandgetguidancefromit

P (t rue)P (t rue)

G

D

x = G(z; ✓(G))

min

Gmax

DEx⇠pdata(x)[logD(x)] + E

z⇠pz(z)[log(1�D(G(z))]

ProblemforDiscreteData• Oncontinuousdata,thereisdirectgradient

• Guidethegeneratorto(slightly)modifytheoutput

• Nodirectgradientondiscretedata• Textgenerationexample

• “Icaughtapenguininthepark”• FromIanGoodfellow:“Ifyououtputtheword‘penguin’,youcan'tchangethatto"penguin+.001"onthenextstep,becausethereisnosuchwordas"penguin+.001".Youhavetogoallthewayfrom"penguin"to"ostrich".”

P (t rue)P (t rue)

G

D

[https://www.reddit.com/r/MachineLearning/comments/40ldq6/generative_adversarial_networks_for_text/]

r✓(G)

1

m

mX

i=1

log(1�D(G(z(i))))

SeqGAN

• Generatorisareinforcementlearningpolicyofgeneratingasequence• decidethenextword togenerate (action)giventhepreviousonesas the state

• Discriminatorprovidesthereward(i.e.theprobabilityofbeingtruedata) forthe sequence

G✓(yt|Y1:t�1)

D�(Yn1:T )

SequenceGenerator• Objective:tomaximizetheexpectedreward

• State-actionvaluefunction istheexpectedaccumulativerewardthat• Startfromstates• Takingaction a• Andfollowingpolicy Guntiltheend

• Rewardisonlyoncompletedsequence(noimmediatereward)

J(✓) = E[RT |s0, ✓] =X

y12YG✓(y1|s0) ·QG✓

D�(s0, y1)

QG✓D�

(s, a)

QG✓D�

(s = Y1:T�1, a = yT ) = D�(Y1:T )

State-ActionValueSetting• Rewardisonlyoncompletedsequence• Noimmediatereward• Thenthelast-stepstate-actionvalue

• Forintermediatestate-actionvalue• UseMonteCarlosearchtoestimate

• Followingaroll-outpolicy GG

QG✓D�

(s = Y1:T�1, a = yT ) = D�(Y1:T )

QG✓D�

(s = Y1:t�1, a = yt) =⇢

1N

PNn=1 D�(Y n

1:T ), Y n1:T 2 MC

G�(Y1:t;N) for t < T

D�(Y1:t) for t = T ,

�Y 11:T , . . . , Y

N1:T

= MCG� (Y1:t;N)

TrainingSequenceDiscriminator• Objective:standardbi-classification

min

��EY⇠pdata [logD�(Y )]� EY⇠G✓ [log(1�D�(Y ))]

TrainingSequenceGenerator• Policygradient(REINFORCE)

[RichardSuttonetal.PolicyGradientMethodsforReinforcementLearningwithFunctionApproximation.NIPS1999.]

r✓J(✓) = EY1:t�1⇠G✓ [

X

yt2Yr✓G✓(yt|Y1:t�1) ·QG✓

D�(Y1:t�1, yt)]

' 1

T

TX

t=1

X

yt2Yr✓G✓(yt|Y1:t�1) ·QG✓

D�(Y1:t�1, yt)

=

1

T

TX

t=1

X

yt2YG✓(yt|Y1:t�1)r✓ logG✓(yt|Y1:t�1) ·QG✓

D�(Y1:t�1, yt)

=

1

T

TX

t=1

Eyt⇠G✓(yt|Y1:t�1)[r✓ logG✓(yt|Y1:t�1) ·QG✓D�

(Y1:t�1, yt)],

✓ ✓ + ↵hr✓J(✓)

OverallAlgorithm

SequenceGeneratorModel

• RNNwithLSTMcells

[Hochreiter,S.,andSchmidhuber,J.1997.Longshort-termmemory.Neuralcomputation9(8):1735–1780.]

Shanghai is incredibly

is incrediblySoftmax samplingovervocabulary

?

SequenceDiscriminatorModel

[Kim,Y.2014.Convolutionalneuralnetworksforsentenceclassification.EMNLP2014.]

InconsistencyofEvaluationandUse

• Checkwhetheratruedataiswithahighmassdensityofthelearnedmodel• Approximatedby

Evaluation Use

• Checkwhetheramodel-generateddataisconsideredasrealaspossible• Morestraightforwardbutitishardorimpossibletodirectlycalculate

• Givenagenerator withacertaingeneralizationabilityG✓

max

✓

1

|D|X

x2D

[logG

✓

(x)]

Ex⇠ptrue(x)[logG✓

(x)] Ex⇠G✓(x)[log ptrue(x)]

ptrue(x)

ExperimentsonSyntheticData• EvaluationmeasurewithOracle

• An oracle model (e.g. the randomly initialized LSTM)• Firstly, the oracle model produces some sequences astraining data for the generative model• Secondly the oracle model can be considered as thehuman observer to accurately evaluate the perceptualquality of the generative model

NLL

oracle

= �EY1:T⇠G✓

h TX

t=1

logGoracle

(yt|Y1:t�1

)

i

ExperimentsonSyntheticData• EvaluationmeasurewithOracle

NLL

oracle

= �EY1:T⇠G✓

h TX

t=1

logGoracle

(yt|Y1:t�1

)

i

ExperimentsonSyntheticData• The training strategy really matters.

ExperimentsonReal-WorldData• Chinesepoemgeneration

• Obamapoliticalspeechtextgeneration

• Midimusicgeneration

ExperimentsonReal-WorldData• Chinesepoemgeneration

南陌春风早，东邻去日斜。

紫陌追随日，青门相见时。

胡风不开花，四气多作雪。

山夜有雪寒，桂里逢客时。

此时人且饮，酒愁一节梦。

四面客归路，桂花开青竹。

Human Machine

ObamaSpeechTextGeneration• i stoodheretodayi haveoneandmostimportantthingthatnotonviolencethroughoutthehorizonisOTHERSamericanfireandOTHERSbutweneedyouareastrongsource

• forthisbusinessleadershipwillremembernowi can’taffordtostartwithjustthewayoureuropean supportfortherightthingtoprotectthoseamericanstoryfromtheworldand

• i wanttoacknowledgeyouweregoingtobeanoutstandingjobtimesforstudentmedicaleducationandwarmtherepublicanswholikemytimesifhesaidisthatbroughtthe

• whenhewastoldofthisextraordinaryhonorthathewasthemosttrustedmaninamerica

• butwealsorememberandcelebratethejournalismthatwalter practicedastandardofhonestyandintegrityandresponsibilitytowhichsomanyofyouhavecommittedyourcareers.it'sastandardthat'salittlebithardertofindtoday

• i amhonoredtobeheretopaytributetothelifeandtimesofthemanwhochronicledourtime.

Human Machine

Summary

• We proposed a sequence generation method, calledSeqGAN, toeffectivelytrainGenerativeAdversarialNetsfor discrete structured sequencesgenerationviapolicygradient.

• Design an experiment framework with oracle evaluationmetric to accurately evaluate the “perceptual quality”of model-generated sequences.

ThankYou

Lantao Yu,WeinanZhang,JunWang,YongYu.SeqGAN:SequenceGenerativeAdversarialNetswithPolicyGradient.AAAI2017.

Lantao Yu

†, Weinan Zhang

†, Jun Wang

‡, Yong Yu

††Shanghai Jiao Tong University,

‡University College London

Email: [email protected]

SeqGAN - Lantao Yu

Documents

Transcript of SeqGAN - Lantao Yu