Adversarial Learning
Transcript of Adversarial Learning
Adversarial LearningoWhat is a GAN?oSome mathematical backgroundoAlgorithms for training a GANoThe Wasserstein GANoConditional GANs
How do we sample from a distribution?
π βΌ π! π¦!Parametric methods:
β Some known parametric distribution:
π! π¦ =1π exp βπ’ π¦
β Monte Carlo Markov Chain (MCMC) methodsβ Hastings-Metropolis samplingβ Requires a known distribution
!Non-parametric methods:β Provided with samples π¦", β― , π¦#$%β Infer distribution from samplesβ Generator
π = β" π
1
https://sdv.dev/Copulas/tutorials/03_Multivariate_Distributions.html
Multivariate random vector in β!
Multivariate density in β!
Source of randomness; π βΌ π 0, πΌ
Random vector with desired distribution
Training a Generator
!Function of GAN:β Generates samples, -π¦&, with the same distribution as π¦&.β Can use drop-outs to generate randomness
!Training algorithm:β Compare the distributions of -π¦& and π¦&.β Feedback parameter corrections
(π¦"generatedsamples
Generator(π¦" = β# π§"
TrainingAlgorithm
π§"independent noise source
π¦"referencesamples
π
The Bayes Discriminator
!Use Bayes rule:π πΆπππ π = π |π¦ =
π π¦|πΆπππ π = π π πΆπππ π = π π π¦|πΆπππ π = π π πΆπππ π = π + π π¦|πΆπππ π = πΉ π πΆπππ π = πΉ
β’ Assuming π π¦|πΆπππ π = πΉ = π!$ π¦ , π π¦|πΆπππ π = π = π" π¦ , π πΆπππ π = π =π πΆπππ π = πΉ = #
$ , we have that
π πΆπππ π = π |π¦ = =π% π¦ Β½
π% π¦ Β½ + π#! π¦ Β½=
11 + π #! π¦
β where π #! π¦ =&"! '
&# 'is the likelihood ratio defined by
(π¦"generatedsamples
Generator(π¦" = β#! π§"
π§"independentnoise source
π¦"referencesamples
βR -Realβ
βF - Fakeβ
What is the probability that an observation, π¦, is real?
Implementing a Bayes Discriminator
!Bayesian Discriminator:π'( π¦ β π πΆπππ π = π |π¦
= !% "!% " #!&' "
= $$#%&' "
where
π '! π¦ =π'! π¦π! π¦
Likelihood Ratio
(π¦"generatedsamples
Generator(π¦" = β#! π§"
Discriminator (π" = π#( (π¦"
π§"independentnoise source
π¦"referencesamples
Discriminator π" = π#( π¦"
Should be mostly 0s
Should be mostly 1s
Bayes Discriminator and the Likelihood Ratio
!Bayesian Discriminator:π'( π¦ β π πΆπππ π = π |π¦
= !% "!% " #!&' "
= $$#%&' "
generated distribution π#! π¦
reference distribution π% π¦
π¦
likelihood ratio π #! π¦
1
Classify as βrealβ Classify as βfakeβ
π¦
Training a Bayes Discriminator
!Discriminator loss function:7π π), π( =
1πΎ;&*"
#$%
β log π'( π¦& β log 1 β π'( -π¦&
!Optimal discriminator parameter:
π(β = argmin'"
7π π), π(
β Results in ML estimate of Bayes classifier parameters.
(π¦"generatedsamples
Generator(π¦" = β#! π§"
Discriminator(π" = π#( (π¦"
π§"
π¦"
π¦"referencesamples
0 =Generated
Discriminatorπ" = π#( π¦"
1 =Reference+ 9π π) , π(
CrossEntropy( (π!, 0)
CrossEntropy(π!, 1)π"
(π"
Training a Generator
(π¦"generatedsamples
Generator(π¦" = β#! π§"
Discriminator (π" = π#( (π¦"
π§" <π π) , π(Loss Function
πΏ (π!
!Big idea: Maximize the probability that outputs of the generator are classified as being from the reference distribution.
β /π/ should be largeβ πΏ /π/ should be small when /π/ is largeβ πΏ /π/ should be a decreasing function of /π/
!Generator loss function:
1π π0, π1 =1πΎ6/23
45#
πΏ π!1 /π¦/
!Optimal generator parameter:
π0β = argmin!$
π π0, π1
Loss should encourage(π" to be large
Generative Adversarial Network (GAN)*
(π¦"generatedsamples
Generator(π¦" = β#! π§"
Discriminator (π" = π#( (π¦"
π§"
π¦"
π¦"referencesamples
0 =Generated
Discriminator π" = π#( π¦"
1 =Reference + 9π π) , π(
Loss FunctionπΏ (π!
<π π) , π(
CrossEntropy( (π!, 0)
CrossEntropy(π!, 1)
!Generator loss function:
Dπ π), π( =1πΎ;&*"
#$%
πΏ π'( -π¦&
!Discriminator loss function:7π π), π( =
1πΎ;&*"
#$%
β log π'( π¦& β log 1 β π'( -π¦&
*Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio, βGenerative Adversarial Networksβ, Proc. of the Intern. Conference on Neural Information Processing Systems (NIPS 2014). pp. 26
GAN: Expected Loss Functions
>π#!generatedsamples
Generator>π = β#! π
Discriminator >π#! = π#(
>π#!π
π
πreferencesamples
0 =Generated
Discriminator π = π#( π
1 =Reference + π π) , π(
Loss FunctionπΏ 7π"!
π π) , π(
CrossEntropy( 7π, 0)
CrossEntropy(π, 1)
!Generator loss function:π π), π( = πΈ πΏ π'( Hπ'!
!Discriminator loss function:π π), π( = πΈ β log π'( π + πΈ βlog 1 β π'( Hπ'!
β By the weak and strong law of large numbers lim#β-
Dπ = π and lim#β-
7π = π
Generator Loss Function Choices
!Option 0: Original loss function proposed in [1].β πΏ π = log 1 β π πΏ 0 = 0; πΏ 1 = ββ
β π π), π( = πΈ log 1 β π'( Hπ'!β Presented as key theoretically grounded approach in Goodfellow paper.β Consistent with zero-sum game Nash equilibrium theoryβ Almost no one uses it.
!Option 1: βNon-saturatingβ loss function, i.e., the β-log D trickββ πΏ π = β log π πΏ 0 = +β; πΏ 1 = 0
β π π), π( = πΈ β log π'( Hπ'!β Mentioned as trick in [1] to keep the training loss from βsaturatingβ.β This is what you get if you use cross-entropy loss for generator β This is what is commonly done.
[1] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio. βGenerative Adversarial Networksβ, Proc. of the Intern. Conference on Neural Information Processing Systems (NIPS 2014). pp. 26
GAN Architecture(Non-Saturating)
>π#!generatedsamples
Generator>π = β#! π
Discriminator >π#! = π#(
>π#!π
π
πreferencesamples
0 =Generated
Discriminator π = π#( π
1 =Reference + π π) , π(
CrossEntropy(π, 1) π π) , π(
CrossEntropy( 7π, 0)
CrossEntropy(π, 1)
!Generator loss function:π π), π( = πΈ β log π'( Hπ'!
!Discriminator loss function:π π), π( = πΈ β log π'( π + πΈ βlog 1 β π'( Hπ'!
β By the weak and strong law of large numbers lim#β-
Dπ = π and lim#β-
7π = π
GAN Equilibrium Conditions (Non-Saturating)
!We would like to find the solution to:
π)β = argmin'!
π π), π(β
= argmin'!
πΈ β log π'( Hπ'!
π(β = argmin'"
π π)β, π(
= argmin'"
πΈ β log π'( π + πΈ βlog 1 β π'( Hπ'!
β This is known as a Nash Equilibriumβ We would like it to converge to π β = 1 β (generated = reference distributions)
!How do we solve this?
!Will it converge?
Nash Equilibrium with Two Agents*
!Agent πΊ:β Controls parameter π0β Goal is to minimize π π0, π1β
!Agent π·β Controls parameter π1β Goal is to minimize π π0β, π1
π πO, πPminimize meter
knob
πO
π πO, πPminimize meter
knob
πP
πΊ Agent π· Agent
πO
πP
!Each Agent tries to minimize their meter
*Graphics and art reproduced from βstick figureβ Wiki page.
π π)β, π(β = min'!
π π), π(β
π π)β, π(β = min'"
π π)β, π(
Zero-Sum Game: Special Nash Equilibrium*
!Agent πΊ:β Goal is to minimize π π0, π1β
β Goal is to maximize π π0, π1β
!Agent π·β Goal is to minimize π π0β, π1
π πO, πPmaximize meter
knob
πO
π πO, πPminimize meter
knob
πP
πΊ Agent π· Agent
πO
πP
!Special case when π π&, π' = βπ π&, π'
*Graphics and art reproduced from βstick figureβ Wiki page.
π π)β, π(β = max'!
π π), π(β
π π)β, π(β = min'"
π π)β, π(
Adversarial relationship
Computing the GAN Equilibrium
!Reparameterizing equations
!Alternating minimization β mode collapse
!Generator loss gradient descent
!Practical convergence issues
Reparameterize Loss Functions!Goal: Replace πO and π' with π and π
!Generator parameter π :β Generated samples are
)π βΌ π π¦ πQ π¦ = πR π¦
where π π¦ =.#! /
.$ /
!Discriminator parameter π:β Discriminator is
π π¦ = π πΆπππ π = π |π¦
!Important facts:β πΈ π (π) = 1β Ξ©) = π :β0 β 0,β such that πΈ π π = 1β Ξ©( = π:β0 β 0,1β For any function β π¦ , πΈ β Hπ = πΈ β π π (π)
GAN Equilibrium Conditions
!We would like to find the solution to:
π β = arg minRβa@
π π , πβ
πβ = arg minbβaA
π π β, π
β This is known as a Nash Equilibriumβ We would like it to converge to π β = 1 β (generated = reference distributions)
!How do we do this?
!Will it converge?
Reparameterized Loss Functions!Generator loss function:
π π , π = πΈ β log π Hπ= πΈ βπ (π) log π π
!Discriminator loss function:π π , π = πΈ β log π π + πΈ βlog 1 β π Hπ
= πΈ β log π π + πΈ βπ π log 1 β π π
!Nash equilibrium:
π β = arg min1β3!
π π , πβ
πβ = arg min4β3"
π π β, π
Method 1: Alternating Minimization!Algorithm
!Discriminator update
πβ π¦ β1
1 + π π¦!Generator update
π β π¦ β πΏ π¦ β π¦5 where π¦5 = max/π π¦
Repeat {πβ β arg min
)β+9π π β, π
π β β arg min%β+'
π π , πβ
} Doesnβt Work!
Problem:β’ This is called βMode collapseββ’ Only generates sample that the discriminator likes bestβ’ Intuition:
βWe come from France.β βI like cheese steaks.ββToo good to be true.ββToo creepy to be real.β
Method 2: Generator Loss Gradient Descent (GLGD)!Algorithm
!Discriminator update
πβ π¦ β1
1 + π π¦
!Generator updateβ Take a step in the negative direction of the generator loss gradient.β π: - project into the allow parameter space. (This is not an issue in practice.)
π:π π¦ = π π¦ βπΈ π(π)πΈ π"(π)
π"(π¦)
!Questions/Comments:β Can be applied with a wide variety of generator/discriminator loss functionsβ Does this converge?β If so, then what (if anything) is being minimized?
*Martin Arjovsky and Leon Bottou, βTowards Principled Methods for Training Generative Adversarial Networksβ, ICLR 2017.
Repeat {πβ β arg min
bβaAπ π , π
π β π β πΌπfβRπ π , πβ}
My term
gradient descent stepProjection onto valid
parameter space
GLGD Convergence for Non-Saturating GAN
*This is an equivalent expression to Theorem 2.5 of [1].[1] Martin Arjovsky and Leon Bottou, βTowards Principled Methods for Training Generative Adversarial Networksβ, ICLR 2017.
!For non-saturating GAN when π1β π¦ = arg min4β3"
π π , π , then it can be shown that
β1 π π , πβ = β1πΆ π
where*
πΆ π = πΈ 1 + π π log 1 + π π
!Conclusions:β GLGD is really a gradient descent algorithm for the cost function πΆ π .β 1 + π₯ log 1 + π₯ is a strictly convex function π₯: β Therefore, we know that πΆ π has a unique global minimum at π π = 1.β However, convergence tends to be slow.
Repeat {πβ β arg min
)β+9π π , π
π β π β πΌπ,β%π π , πβ}
More Details on GLGD
*Martin Arjovsky and Leon Bottou, βTowards Principled Methods for Training Generative Adversarial Networksβ, ICLR 2017.
!Arjovsky and Bottou showed that for the non-saturating GAN*π5 β1 π π , πβ = β1 πΎπΏ π1||π! β 2 π½ππ· π1||π!
!So from previous identities, we have that:
π5 β1 π π , πβ = β1 2 πΎπΏ π1 + π! /2||π!
= π5β1πΈ 1 + π π log 1 + π π
!Conclusions:β GLGD is really a gradient descent algorithm for the cost function
πΆ π = 2 πΎπΏ π; + π" /2||π"β πΆ π has a unique global minimum at π; = π"β However, convergence tends to be slow.
Repeat {πβ β arg min
)β+9π π , π
π β π β πΌπ,β%π π , πβ}
Convergence of GANs
!Generator and discriminator at convergence:β Reference and generated distribution are the same β π β π = 1
β Discriminator can not distinguish distributions β πβ π¦ = ##<; =
= #$
!At convergence the generated and reference distributions are identical. β Therefore, the likelihood ratio is π π¦ = 1; β The generated (fake) and reference (real) distributions are identical;β The discriminator assigns a 50/50 probability to either case because they are
indistinguishable.
β Then both the generator are discriminator cross-entropy losses are βlog #$ β 0.693.
!In practice, things donβt usually work out this wellβ¦
Method 2: Practical Algorithm!Algorithm
!What you would like to see
Repeat {For π1 iterations {
π΅ β πΊππ‘π ππππππ΅ππ‘πβπ1 β π1 β π½β!*π π0, π1; π΅
}π΅ β πΊππ‘π ππππππ΅ππ‘πβπ0 β π0 β πΌβ!$π π0, π1; π΅}
Iteration #
Loss
Generator Loss
Β½ Discriminator Loss
β log12 β 0.693
Looks good, butβ¦β Could result from a discriminator with insufficient capacity
Failure Mode: Mode Collapse!Algorithm
!Sometimes you get mode collapse
Iteration #
Loss
Generator Loss
Discriminator dominates
Β½ Discriminator Loss
β 0.693
Repeat {For π = 0 to π' β 1 {
π' β π' β π½β-9π π&, π'}π& β π& β πΌβ-'π π&, π'}
Might be caused by:β Overfitting by discriminatorβ Insufficient number of discriminator updatesβ Insufficient generator capacity
Concept Wasserstein GAN
*Martin Arjovsky, Soumith Chintala and Leon Bottou, βWasserstein Generative Adversarial Networksβ, ICML 2017.
!Problem with GAN training using β β logπ· trickββ Slow and sometimes unstable convergenceβ Problems with vanishing gradient
!Conjecture:β The problem is caused by the discriminator function.β Bayes classifier is
β’ Too sensitive and too nonlinearβ’ Non-overlapping distributions create vanishing gradients that slow convergence.
!Base discriminator of the Wasserstein distance (i.e., earth mover distance)
*Reproduced from paper
Wasserstein Fundamentals
!Based on Kantorovich-Rubinstein duality (Villani, 2009)*
π πQ||πR = supb Bmn
πΈ π π β πΈ π )πR
β where π 6 is the Lipschitz constant of π
β π 6 β€ 1 is referred to as the 1-Lipschitz condition
*Villani, Cedric. Optimal Transport: Old and New. Grundlehren der mathematischen Wissenschaften. Springer, Berlin, 2009.
Wasserstein GAN*!Then the fundamental result of the Arjovsky and Bottou
β Define the sets π β Ξ©) as usual, but define and π β Ξ©) so that
Ξ©'. = π:β/ β ββ,β π . π‘. π 0 β€ 1
β Then define Wasserstein generator and discriminator loss functions asπ1 π , π = πΈ βπ Lπ%
π1 π , π = πΈ π Lπ% β πΈ π π%
!Key result from Arjovsky paper*:β%π1 π , πβ = β%π(π2||π%)
β whereπβ = arg min
)β39>π4 π , π
*Martin Arjovsky, Soumith Chintala, and Leon Bottou, βWasserstein Generative Adversarial Networksβ, ICML 2017.
Method 3: Wasserstein Algorithm!Algorithm
!Discriminator updateβ How do we solve the problem of minimizing the discriminator loss with the Lipschitz
constraint?β Answer: We clip the discriminator weights during training.β Observation: Isnβt this just regularization of the discriminator DNN??
!Generator updateβ Take a step in the negative direction of the generator loss gradient descent.
*Martin Arjovsky and Leon Bottou, βTowards Principled Methods for Training Generative Adversarial Networksβ, ICLR 2017.
Repeat {πβ = arg min
bβpACπq π , π
π β π β πΌπfβRπq π , πβ}
Method 3: Wasserstein Practical Algorithm!Algorithm
!Observations:β Some people seem to feel the Wasserstein GAN has better convergence.β However, is this because of the Wasserstein metric?β Or is it because of the other algorithmic improvements?
*Martin Arjovsky and Leon Bottou, βTowards Principled Methods for Training Generative Adversarial Networksβ, ICLR 2017.
Make sure to get new batches
Iterate discriminator to approximate convergence
Minimize discriminator loss
Clip discriminator weights to approximate Lipschitz constraint
Take gradient step of generator loss
Conditional Generative Adversarial Network
!Generates samples from the conditional distribution of π given π.
!Descriminator takes π¦5, π₯5 input pairs for π = 0,β― , πΎ β 1
(π¦"- GeneratedGenerator
(π¦" = β#! π₯" , π§"
π₯" Discriminator (π" = π#( (π¦" , π₯"
π§"
π¦"π¦"- Reference
0 =Generated
π₯" , π¦"
Discriminator π" = π#( π¦" , π₯"
1 =Reference + π π) , π(
1 = Reference
CrossEntropy( (π!, 1) π π) , π(
CrossEntropy( (π!, 0)
CrossEntropy(π!, 1)
Reference distribution
π₯"
π₯"