Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative...
Transcript of Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative...
Generative Adversarial Network and its Applications to Speech Processing
and Natural Language Processing
Hung-yi Lee
Mihaela Rosca, Balaji Lakshminarayanan, David Warde-Farley, Shakir Mohamed, “Variational Approaches for Auto-Encoding Generative Adversarial Networks”, arXiv, 2017
All Kinds of GAN … https://github.com/hindupuravinash/the-gan-zoo
GAN
ACGAN
BGAN
DCGAN
EBGAN
fGAN
GoGAN
CGAN
……
Outline
Part I: General Introduction of Generative Adversarial Network (GAN)
Part II: Applications to Speech Processing
Part III: Applications to Natural Language Processing
Outline of Part 1
Generation by GAN
Conditional Generation
Unsupervised Conditional Generation
Relation to Reinforcement Learning
Outline of Part 1
Generation by GAN
• Image Generation as Example
• Theory behind GAN
• Issues and Possible Solutions
Conditional Generation
Unsupervised Conditional Generation
Relation to Reinforcement Learning
Anime Face Generation
Draw
Generator
Examples
Basic Idea of GAN
Generator
It is a neural network (NN), or a function.
Generator
0.1−3⋮2.40.9
imagevector
Generator
3−3⋮2.40.9
Generator
0.12.1⋮5.40.9
Generator
0.1−3⋮2.43.5
high dimensional vector
Powered by: http://mattya.github.io/chainer-DCGAN/
Each dimension of input vector represents some characteristics.
Longer hair
blue hair Open mouth
Discri-minator
scalarimage
Basic Idea of GAN It is a neural network (NN), or a function.
Larger value means real, smaller value means fake.
Discri-minator
Discri-minator
Discri-minator1.0 1.0
0.1 Discri-minator
0.1
• Initialize generator and discriminator
• In each training iteration:
DG
sample
generated objects
G
Algorithm
D
Update
vector
vector
vector
vector
0000
1111
randomly sampled
Database
Step 1: Fix generator G, and update discriminator D
Discriminator learns to assign high scores to real objects and low scores to generated objects.
Fix
• Initialize generator and discriminator
• In each training iteration:
DG
Algorithm
Step 2: Fix discriminator D, and update generator G
Discri-minator
NNGenerator
vector
0.13
hidden layer
update fix
Gradient Ascent
large network
Generator learns to “fool” the discriminator
• Initialize generator and discriminator
• In each training iteration:
DG
Learning D
Sample some real objects:
Generate some fake objects:
G
Algorithm
D
Update
Learning G
G Dimage
1111
imageimage
image1
update fix
0000
vector
vector
vector
vector
vector
vector
vector
vector
fix
The faces generated by machine.
The images are generated by Yen-Hao Chen, Po-Chun Chien, Jun-Chen Xie, Tsung-Han Wu.
0.00.0
G
0.90.9
G
0.10.1
G
0.20.2
G
0.30.3
G
0.40.4
G
0.50.5
G
0.60.6
G
0.70.7
G
0.80.8
G
Amazing Results! [Tero Karras, et al., ICLR, 2018]
Amazing Results! [Andrew Brock, et al., arXiv, 2018]
(Variational) Auto-encoder
As close as possible
NNEncoder
NNDecoder
cod
e
NNDecoder
cod
e
Randomly generate a vector as code Image ?
= Generator
= Generator
Auto-encoder v.s. GAN
As close as possibleNNDecoder
cod
e
Genera-tor
cod
e
= Generator
Discri-minator
If discriminator does not simply memorize the images,
Generator learns the patterns of faces.
Just copy an image
1
Auto-encoder
GAN
Fuzzy …
FID[Martin Heusel, et al., NIPS, 2017]: Smaller is better
[Mario Lucic, et al. arXiv, 2017]
Outline of Part 1
Generation
• Image Generation as Example
• Theory behind GAN
• Issues and Possible Solutions
Conditional Generation
Unsupervised Conditional Generation
Relation to Reinforcement Learning
Generator
• A generator G is a network. The network defines a probability distribution 𝑃𝐺
generator G𝑧 𝑥 = 𝐺 𝑧
Normal Distribution
𝑃𝐺(𝑥) 𝑃𝑑𝑎𝑡𝑎 𝑥
as close as possible
How to compute the divergence?
𝐺∗ = 𝑎𝑟𝑔min𝐺
𝐷𝑖𝑣 𝑃𝐺 , 𝑃𝑑𝑎𝑡𝑎Divergence between distributions 𝑃𝐺 and 𝑃𝑑𝑎𝑡𝑎
𝑥: an image (a high-dimensional vector)
Discriminator𝐺∗ = 𝑎𝑟𝑔min
𝐺𝐷𝑖𝑣 𝑃𝐺 , 𝑃𝑑𝑎𝑡𝑎
Although we do not know the distributions of 𝑃𝐺 and 𝑃𝑑𝑎𝑡𝑎, we can sample from them.
sample
G
vector
vector
vector
vector
sample from normal
Database
Sampling from 𝑷𝑮
Sampling from 𝑷𝒅𝒂𝒕𝒂
Discriminator 𝐺∗ = 𝑎𝑟𝑔min𝐺
𝐷𝑖𝑣 𝑃𝐺 , 𝑃𝑑𝑎𝑡𝑎
Discriminator
: data sampled from 𝑃𝑑𝑎𝑡𝑎: data sampled from 𝑃𝐺
train
𝑉 𝐺,𝐷 = 𝐸𝑥∼𝑃𝑑𝑎𝑡𝑎 𝑙𝑜𝑔𝐷 𝑥 + 𝐸𝑥∼𝑃𝐺 𝑙𝑜𝑔 1 − 𝐷 𝑥
Example Objective Function for D
(G is fixed)
𝐷∗ = 𝑎𝑟𝑔max𝐷
𝑉 𝐷, 𝐺Training:
Using the example objective function is exactly the same as training a binary classifier.
[Goodfellow, et al., NIPS, 2014]
The maximum objective value is related to JS divergence.
Discriminator 𝐺∗ = 𝑎𝑟𝑔min𝐺
𝐷𝑖𝑣 𝑃𝐺 , 𝑃𝑑𝑎𝑡𝑎
Discriminator
: data sampled from 𝑃𝑑𝑎𝑡𝑎: data sampled from 𝑃𝐺
train
hard to discriminatesmall divergence
Discriminatortrain
easy to discriminatelarge divergence
𝐷∗ = 𝑎𝑟𝑔max𝐷
𝑉 𝐷, 𝐺
Training:
Small max𝐷
𝑉 𝐷, 𝐺
𝐺∗ = 𝑎𝑟𝑔min𝐺
𝐷𝑖𝑣 𝑃𝐺 , 𝑃𝑑𝑎𝑡𝑎max𝐷
𝑉 𝐺,𝐷
The maximum objective value is related to JS divergence.
• Initialize generator and discriminator
• In each training iteration:
Step 1: Fix generator G, and update discriminator D
Step 2: Fix discriminator D, and update generator G
𝐷∗ = 𝑎𝑟𝑔max𝐷
𝑉 𝐷, 𝐺
[Goodfellow, et al., NIPS, 2014]
Using the divergence you like ☺[Sebastian Nowozin, et al., NIPS, 2016]
Can we use other divergence?
Outline of Part 1
Generation
• Image Generation as Example
• Theory behind GAN
• Issues and Possible Solutions
Conditional Generation
Unsupervised Conditional Generation
Relation to Reinforcement Learning
More tips and tricks:https://github.com/soumith/ganhacks
GAN is hard to train ……
• There is a saying ……
(I found this joke from 陳柏文’s facebook.)
JS divergence is not suitable
• In most cases, 𝑃𝐺 and 𝑃𝑑𝑎𝑡𝑎 are not overlapped.
• 1. The nature of data
• 2. Sampling
Both 𝑃𝑑𝑎𝑡𝑎 and 𝑃𝐺 are low-dim manifold in high-dim space.
𝑃𝑑𝑎𝑡𝑎𝑃𝐺
The overlap can be ignored.
Even though 𝑃𝑑𝑎𝑡𝑎 and 𝑃𝐺have overlap.
If you do not have enough sampling ……
𝑃𝑑𝑎𝑡𝑎𝑃𝐺0 𝑃𝑑𝑎𝑡𝑎𝑃𝐺1
𝐽𝑆 𝑃𝐺0 , 𝑃𝑑𝑎𝑡𝑎= 𝑙𝑜𝑔2
𝑃𝑑𝑎𝑡𝑎𝑃𝐺100
……
𝐽𝑆 𝑃𝐺1 , 𝑃𝑑𝑎𝑡𝑎= 𝑙𝑜𝑔2
𝐽𝑆 𝑃𝐺100 , 𝑃𝑑𝑎𝑡𝑎= 0
What is the problem of JS divergence?
……
JS divergence is log2 if two distributions do not overlap.
Intuition: If two distributions do not overlap, binary classifier achieves 100% accuracy
Equally bad
Same objective value is obtained. Same divergence
Wasserstein distance
• Considering one distribution P as a pile of earth, and another distribution Q as the target
• The average distance the earth mover has to move the earth.
𝑃 𝑄
d
𝑊 𝑃,𝑄 = 𝑑
Wasserstein distance
Source of image: https://vincentherrmann.github.io/blog/wasserstein/
𝑃
𝑄
Using the “moving plan” with the smallest average distance to define the Wasserstein distance.
There are many possible “moving plans”.
Smaller distance?
Larger distance?
𝑃𝑑𝑎𝑡𝑎𝑃𝐺0 𝑃𝑑𝑎𝑡𝑎𝑃𝐺1
𝐽𝑆 𝑃𝐺0 , 𝑃𝑑𝑎𝑡𝑎= 𝑙𝑜𝑔2
𝑃𝑑𝑎𝑡𝑎𝑃𝐺100
……
𝐽𝑆 𝑃𝐺1 , 𝑃𝑑𝑎𝑡𝑎= 𝑙𝑜𝑔2
𝐽𝑆 𝑃𝐺100 , 𝑃𝑑𝑎𝑡𝑎= 0
What is the problem of JS divergence?
𝑊 𝑃𝐺0 , 𝑃𝑑𝑎𝑡𝑎= 𝑑0
𝑊 𝑃𝐺1 , 𝑃𝑑𝑎𝑡𝑎= 𝑑1
𝑊 𝑃𝐺100 , 𝑃𝑑𝑎𝑡𝑎= 0
𝑑0 𝑑1
……
……
Better!
WGAN
𝑉 𝐺,𝐷= max
𝐷∈1−𝐿𝑖𝑝𝑠𝑐ℎ𝑖𝑡𝑧𝐸𝑥~𝑃𝑑𝑎𝑡𝑎 𝐷 𝑥 − 𝐸𝑥~𝑃𝐺 𝐷 𝑥
Evaluate wasserstein distance between 𝑃𝑑𝑎𝑡𝑎 and 𝑃𝐺
[Martin Arjovsky, et al., arXiv, 2017]
How to fulfill this constraint?D has to be smooth enough.
real
−∞
generated
D
∞Without the constraint, the training of D will not converge.
Keeping the D smooth forces D(x) become ∞ and −∞
• Original WGAN → Weight Clipping [Martin Arjovsky, et al., arXiv, 2017]
• Improved WGAN → Gradient Penalty [Ishaan Gulrajani, NIPS, 2017]
• Spectral Normalization → Keep gradient norm smaller than 1 everywhere [Miyato, et al., ICLR, 2018]
Force the parameters w between c and -c
After parameter update, if w > c, w = c; if w < -c, w = -c
Keep the gradient close to 1
𝑉 𝐺,𝐷 = max𝐷∈1−𝐿𝑖𝑝𝑠𝑐ℎ𝑖𝑡𝑧
𝐸𝑥~𝑃𝑑𝑎𝑡𝑎 𝐷 𝑥 − 𝐸𝑥~𝑃𝐺 𝐷 𝑥
real
samples
Keep the gradient close to 1
[Kodali, et al., arXiv, 2017][Wei, et al., ICLR, 2018]
Energy-based GAN (EBGAN)
• Using an autoencoder as discriminator D
Discriminator
0 for the best images
Generator is the same.
-0.1
EN DE
Autoencoder
X -1 -0.1
[Junbo Zhao, et al., arXiv, 2016]
➢Using the negative reconstruction error of auto-encoder to determine the goodness
➢Benefit: The auto-encoder can be pre-train by real images without generator.
Tip: Improve Quality during Testing
generator G
Normal Distribution
generator G
Smaller Variance
本技巧由柯達方提供
Some samples are poor.
The output would be more stable, but sacrifice the diversity.
This tip is also used in [Andrew Brock, et al., arXiv, 2018]
Mode Collapse
: real data
: generated data
Training with too many iterations ……
Mode Dropping
BEGAN on CelebA
Generator at iteration t
Generator at iteration t+1
Generator at iteration t+2
Generator switches mode during training
Tip: Ensemble
Generator1
Generator2
……
……
Train a set of generators: 𝐺1, 𝐺2, ⋯ , 𝐺𝑁
Random pick a generator 𝐺𝑖, and then use 𝐺𝑖 to generate the image
To generate an image
Objective Evaluation
Off-the-shelf Image Classifier
𝑥 𝑃 𝑦|𝑥
Concentrated distribution means higher visual quality
CNN𝑥1 𝑃 𝑦1|𝑥1
Uniform distribution means higher variety
CNN𝑥2 𝑃 𝑦2|𝑥2
CNN𝑥3 𝑃 𝑦3|𝑥3
…
𝑃 𝑦 =1
𝑁
𝑛
𝑃 𝑦𝑛|𝑥𝑛
[Tim Salimans, et al., NIPS, 2016]
𝑥: image
𝑦: class (output of CNN)
e.g. Inception net, VGG, etc.
class 1
class 2
class 3
Objective Evaluation
=
𝑥
𝑦
𝑃 𝑦|𝑥 𝑙𝑜𝑔𝑃 𝑦|𝑥
−
𝑦
𝑃 𝑦 𝑙𝑜𝑔𝑃 𝑦
Negative entropy of P(y|x)
Entropy of P(y)
Inception Score
𝑃 𝑦 =1
𝑁
𝑛
𝑃 𝑦𝑛|𝑥𝑛𝑃 𝑦|𝑥
class 1
class 2
class 3
[Tim Salimans, et al., NIPS 2016]
Outline of Part 1
Generation
Conditional Generation
Unsupervised Conditional Generation
Relation to Reinforcement Learning
• Original Generator
• Conditional Generator
G
condition c
𝑧 𝑥 = 𝐺 𝑧
Normal Distribution G
𝑃𝐺 𝑥 → 𝑃𝑑𝑎𝑡𝑎 𝑥
𝑧
Normal Distribution 𝑥 = 𝐺 𝑐, 𝑧
𝑃𝐺 𝑥|𝑐 → 𝑃𝑑𝑎𝑡𝑎 𝑥|𝑐
[Mehdi Mirza, et al., arXiv, 2014]
NNGenerator
“Girl with red hair and red eyes”
“Girl with yellow ribbon”
e.g. Text-to-Image
Target of NN output
Text-to-Image
• Traditional supervised approach
NN Image
Text: “train”
a dog is running
a bird is flying
A blurry image!
c1: a dog is running
as close as possible
Conditional GAN
D (original)
scalar𝑥
G𝑧Normal distribution
x = G(c,z)c: train
x is real image or not
Image
Real images:
Generated images:
1
0
Generator will learn to generate realistic images ….
But completely ignore the input conditions.
[Scott Reed, et al, ICML, 2016]
Conditional GAN
D (better)
scalar𝑐
𝑥
True text-image pairs:
G𝑧Normal distribution
x = G(c,z)c: train
Image
x is realistic or not + c and x are matched or not
(train , )
(train , )(cat , )
[Scott Reed, et al, ICML, 2016]
1
00
x is realistic or not + c and x are matched or not
Conditional GAN - Discriminator
[Takeru Miyato, et al., ICLR, 2018]
[Han Zhang, et al., arXiv, 2017]
[Augustus Odena et al., ICML, 2017]
condition c
object x
NetworkNetwork
Network
score
Network
Network
(almost every paper)
condition c
object x
c and x are matched or not
x is realistic or not
Conditional GAN
paired data
blue eyesred hairshort hair
Collecting anime faces and the description of its characteristics
red hair,green eyes
blue hair,red eyes
The images are generated by Yen-Hao Chen, Po-Chun Chien, Jun-Chen Xie, Tsung-Han Wu.
Conditional GAN - Image-to-image
G𝑧
x = G(c,z)𝑐
[Phillip Isola, et al., CVPR, 2017]
Image translation, or pix2pix
as close as possible
Conditional GAN - Image-to-image
• Traditional supervised approach
NN Image
It is blurry.
Testing:
input L1
e.g. L1
[Phillip Isola, et al., CVPR, 2017]
Conditional GAN - Image-to-image
Testing:
input L1 GAN
G𝑧
Image D scalar
GAN + L1
L1
[Phillip Isola, et al., CVPR, 2017]
Conditional GAN - Video Generation
Generator
Discriminator
Last frame is real or generated
Discriminator thinks it is real
[Michael Mathieu, et al., arXiv, 2015]
https://github.com/dyelax/Adversarial_Video_Generation
Conditional GAN - Sound-to-image
Gc: sound Image
"a dog barking sound"
Training Data Collection
video
Conditional GAN - Sound-to-image• Audio-to-image
https://wjohn1483.github.io/audio_to_scene/index.html
The images are generated by Chia-Hung Wan and Shun-Po Chuang.
Louder
Conditional GAN - Image-to-label
Multi-label Image Classifier = Conditional Generator
Input condition
Generated output
Conditional GAN - Image-to-labelF1 MS-COCO NUS-WIDE
VGG-16 56.0 33.9
+ GAN 60.4 41.2
Inception 62.4 53.5
+GAN 63.8 55.8
Resnet-101 62.8 53.1
+GAN 64.0 55.4
Resnet-152 63.3 52.1
+GAN 63.9 54.1
Att-RNN 62.1 54.7
RLSD 62.0 46.9
The classifiers can have different architectures.
The classifiers are trained as conditional GAN.
[Tsai, et al., submitted to ICASSP 2019]
Conditional GAN - Image-to-labelF1 MS-COCO NUS-WIDE
VGG-16 56.0 33.9
+ GAN 60.4 41.2
Inception 62.4 53.5
+GAN 63.8 55.8
Resnet-101 62.8 53.1
+GAN 64.0 55.4
Resnet-152 63.3 52.1
+GAN 63.9 54.1
Att-RNN 62.1 54.7
RLSD 62.0 46.9
The classifiers can have different architectures.
The classifiers are trained as conditional GAN.
Conditional GAN outperforms other models designed for multi-label.
Domain Adversarial Training
• Training and testing data are in different domains
Training data:
Testing data:
Generator
Generator
The same distribution
feature
feature
Take digit classification as example
blue points
red points
Domain Adversarial Training
feature extractor (Generator)
Discriminator(Domain classifier)
image Which domain?
Always output zero vectors
Domain Classifier Fails
Domain Adversarial Training
feature extractor (Generator)
Discriminator(Domain classifier)
image
Label predictor
Which digits?
Not only cheat the domain classifier, but satisfying label predictor at the same time
More speech-related applications in Part II.
Successfully applied on image classification [Ganin et al, ICML, 2015][Ajakan et al. JMLR, 2016 ]
Which domain?
Outline of Part 1
Generation
Conditional Generation
Unsupervised Conditional Generation
Relation to Reinforcement Learning
Unsupervised Conditional GAN
GObject in Domain X Object in Domain Y
Transform an object from one domain to another without paired data
Domain X Domain Y
photos
Condition Generated Object
Vincent van Gogh’s paintings
Not Paired
More Applications in Parts II and III
Use image style transfer as example here
Unsupervised Conditional Generation• Approach 1: Direct Transformation
• Approach 2: Projection to Common Space
?𝐺𝑋→𝑌
Domain X Domain Y
For texture or color change
𝐸𝑁𝑋 𝐷𝐸𝑌
Encoder of domain X
Decoder of domain Y
Larger change, only keep the semantics
Domain YDomain X FaceAttribute
?
Direct Transformation
𝐺𝑋→𝑌
Domain X
Domain Y
𝐷𝑌
Domain Y
Domain X
scalar
Input image belongs to domain Y or not
Become similar to domain Y
Direct Transformation
𝐺𝑋→𝑌
Domain X
Domain Y
𝐷𝑌
Domain Y
Domain X
scalar
Input image belongs to domain Y or not
Become similar to domain Y
Not what we want!
ignore input
Direct Transformation
𝐺𝑋→𝑌
Domain X
Domain Y
𝐷𝑌
Domain X
scalar
Input image belongs to domain Y or not
Become similar to domain Y
Not what we want!
ignore input
[Tomer Galanti, et al. ICLR, 2018]
The issue can be avoided by network design.
Simpler generator makes the input and output more closely related.
Direct Transformation
𝐺𝑋→𝑌
Domain X
Domain Y
𝐷𝑌
Domain X
scalar
Input image belongs to domain Y or not
Become similar to domain Y
EncoderNetwork
EncoderNetwork
pre-trained
as close as possible
Baseline of DTN [Yaniv Taigman, et al., ICLR, 2017]
Direct Transformation – Cycle GAN
𝐺𝑋→𝑌
𝐷𝑌
Domain Y
scalar
Input image belongs to domain Y or not
𝐺Y→X
as close as possible
Lack of information for reconstruction
[Jun-Yan Zhu, et al., ICCV, 2017]
Cycle consistency
Direct Transformation – Cycle GAN
𝐺𝑋→𝑌 𝐺Y→X
as close as possible
𝐺Y→X 𝐺𝑋→𝑌
as close as possible
𝐷𝑌𝐷𝑋scalar: belongs to domain Y or not
scalar: belongs to domain X or not
Cycle GAN
Dual GAN
Disco GAN
[Jun-Yan Zhu, et al., ICCV, 2017]
[Zili Yi, et al., ICCV, 2017]
[Taeksoo Kim, et
al., ICML, 2017]
For multiple domains, considering starGAN
[Yunjey Choi, arXiv, 2017]
Issue of Cycle Consistency
• CycleGAN: a Master of Steganography[Casey Chu, et al., NIPS workshop, 2017]
𝐺Y→X𝐺𝑋→𝑌
The information is hidden.
Unsupervised Conditional Generation• Approach 1: Direct Transformation
• Approach 2: Projection to Common Space
?𝐺𝑋→𝑌
Domain X Domain Y
For texture or color change
𝐸𝑁𝑋 𝐷𝐸𝑌
Encoder of domain X
Decoder of domain Y
Larger change, only keep the semantics
Domain YDomain X FaceAttribute
Domain X Domain Y
𝐸𝑁𝑋
𝐸𝑁𝑌 𝐷𝐸𝑌
𝐷𝐸𝑋image
image
image
imageFaceAttribute
Projection to Common Space
Target
Domain X Domain Y
𝐸𝑁𝑋
𝐸𝑁𝑌 𝐷𝐸𝑌
𝐷𝐸𝑋image
image
image
image
Minimizing reconstruction error
Projection to Common SpaceTraining
𝐸𝑁𝑋
𝐸𝑁𝑌 𝐷𝐸𝑌
𝐷𝐸𝑋image
image
image
image
Minimizing reconstruction error
Because we train two auto-encoders separately …
The images with the same attribute may not project to the same position in the latent space.
𝐷𝑋
𝐷𝑌
Discriminator of X domain
Discriminator of Y domain
Minimizing reconstruction error
Projection to Common SpaceTraining
Sharing the parameters of encoders and decoders
Projection to Common SpaceTraining
𝐸𝑁𝑋
𝐸𝑁𝑌
𝐷𝐸𝑋
𝐷𝐸𝑌
Couple GAN[Ming-Yu Liu, et al., NIPS, 2016]
UNIT[Ming-Yu Liu, et al., NIPS, 2017]
𝐸𝑁𝑋
𝐸𝑁𝑌 𝐷𝐸𝑌
𝐷𝐸𝑋image
image
image
image
Minimizing reconstruction error
The domain discriminator forces the output of 𝐸𝑁𝑋 and 𝐸𝑁𝑌 have the same distribution.
From 𝐸𝑁𝑋 or 𝐸𝑁𝑌
𝐷𝑋
𝐷𝑌
Discriminator of X domain
Discriminator of Y domain
Projection to Common SpaceTraining
DomainDiscriminator
𝐸𝑁𝑋 and 𝐸𝑁𝑌 fool the domain discriminator
[Guillaume Lample, et al., NIPS, 2017]
𝐸𝑁𝑋
𝐸𝑁𝑌 𝐷𝐸𝑌
𝐷𝐸𝑋image
image
image
image
𝐷𝑋
𝐷𝑌
Discriminator of X domain
Discriminator of Y domain
Projection to Common SpaceTraining
Cycle Consistency:
Used in ComboGAN [Asha Anoosheh, et al., arXiv, 017]
Minimizing reconstruction error
𝐸𝑁𝑋
𝐸𝑁𝑌 𝐷𝐸𝑌
𝐷𝐸𝑋image
image
image
image
𝐷𝑋
𝐷𝑌
Discriminator of X domain
Discriminator of Y domain
Projection to Common SpaceTraining
Semantic Consistency:
Used in DTN [Yaniv Taigman, et al., ICLR, 2017] and XGAN [Amélie Royer, et al., arXiv, 2017]
To the same latent space
Outline of Part 1
Generation
Conditional Generation
Unsupervised Conditional Generation
Relation to Reinforcement Learning
Basic Components
EnvActorReward
Function
VideoGame
Go
Get 20 scores when killing a monster
The rule of GO
You cannot control
• Input of neural network: the observation of machine represented as a vector or a matrix
• Output neural network : each action corresponds to a neuron in output layer
……
NN as actor
pixelsfire
right
left
Score of an action
0.7
0.2
0.1
Take the action based on the probability.
Neural network as Actor
Actor, Environment, Reward
𝜏 = 𝑠1, 𝑎1, 𝑠2, 𝑎2, ⋯ , 𝑠𝑇 , 𝑎𝑇
Trajectory
Actor
𝑠1
𝑎1
Env
𝑠2
Env
𝑠1
𝑎1
Actor
𝑠2
𝑎2
Env
𝑠3
𝑎2
……
“right” “fire”
Reward Function → Discriminator
Reinforcement Learning v.s. GAN
Actor
𝑠1
𝑎1
Env
𝑠2
Env
𝑠1
𝑎1
Actor
𝑠2
𝑎2
Env
𝑠3
𝑎2
……
𝑅 𝜏 =
𝑡=1
𝑇
𝑟𝑡
Reward
𝑟1
Reward
𝑟2
“Black box”You cannot use backpropagation.
Actor → Generator Fixed
updatedupdated
Imitation Learning
We have demonstration of the expert.
Actor
𝑠1
𝑎1
Env
𝑠2
Env
𝑠1
𝑎1
Actor
𝑠2
𝑎2
Env
𝑠3
𝑎2
……
reward function is not available
Ƹ𝜏1, Ƹ𝜏2, ⋯ , Ƹ𝜏𝑁Each Ƹ𝜏 is a trajectory of the expert.
Self driving: record human drivers
Robot: grab the arm of robot
Inverse Reinforcement Learning
Reward Function
Environment
Optimal Actor
Inverse Reinforcement Learning
➢Using the reward function to find the optimal actor.
➢Modeling reward can be easier. Simple reward function can lead to complex policy.
Reinforcement Learning
Expert
demonstration of the expert
Ƹ𝜏1, Ƹ𝜏2, ⋯ , Ƹ𝜏𝑁
Framework of IRL
Expert ො𝜋
Actor 𝜋
ObtainReward Function R
Ƹ𝜏1, Ƹ𝜏2, ⋯ , Ƹ𝜏𝑁
𝜏1, 𝜏2, ⋯ , 𝜏𝑁
Find an actor based on reward function R
By Reinforcement learning
𝑛=1
𝑁
𝑅 Ƹ𝜏𝑛 >
𝑛=1
𝑁
𝑅 𝜏
Reward function → Discriminator
Actor → Generator
Reward Function R
The expert is always the best.
𝜏1, 𝜏2, ⋯ , 𝜏𝑁
GAN
IRL
G
D
High score for real, low score for generated
Find a G whose output obtains large score from D
Ƹ𝜏1, Ƹ𝜏2, ⋯ , Ƹ𝜏𝑁Expert
Actor
Reward Function
Larger reward for Ƹ𝜏𝑛,Lower reward for 𝜏
Find a Actor obtains large reward
Concluding Remarks
Generation
Conditional Generation
Unsupervised Conditional Generation
Relation to Reinforcement Learning
Reference
• Generation • Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-
Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio, Generative Adversarial Networks, NIPS, 2014
• Sebastian Nowozin, Botond Cseke, Ryota Tomioka, “f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization”, NIPS, 2016
• Martin Arjovsky, Soumith Chintala, Léon Bottou, Wasserstein GAN, arXiv, 2017
• Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, Aaron Courville, Improved Training of Wasserstein GANs, NIPS, 2017
• Junbo Zhao, Michael Mathieu, Yann LeCun, Energy-based Generative Adversarial Network, arXiv, 2016
• Mario Lucic, Karol Kurach, Marcin Michalski, Sylvain Gelly, Olivier Bousquet, “Are GANs Created Equal? A Large-Scale Study”, arXiv, 2017
• Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, Xi Chen Improved Techniques for Training GANs, NIPS, 2016
• Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, Sepp Hochreiter, GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium, NIPS, 2017
Reference
• Generation • Naveen Kodali, Jacob Abernethy, James Hays, Zsolt Kira, “On Convergence
and Stability of GANs”, arXiv, 2017
• Xiang Wei, Boqing Gong, Zixia Liu, Wei Lu, Liqiang Wang, Improving the Improved Training of Wasserstein GANs: A Consistency Term and Its Dual Effect, ICLR, 2018
• Takeru Miyato, Toshiki Kataoka, Masanori Koyama, Yuichi Yoshida, Spectral Normalization for Generative Adversarial Networks, ICLR, 2018
• Tero Karras, Timo Aila, Samuli Laine, Jaakko Lehtinen, Progressive Growing of GANs for Improved Quality, Stability, and Variation, ICLR, 2018
• Andrew Brock, Jeff Donahue, Karen Simonyan, Large Scale GAN Training for High Fidelity Natural Image Synthesis, arXiv, 2018
Reference
• Conditional Generation• Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt
Schiele, Honglak Lee, Generative Adversarial Text to Image Synthesis, ICML, 2016
• Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, Alexei A. Efros, Image-to-Image Translation with Conditional Adversarial Networks, CVPR, 2017
• Michael Mathieu, Camille Couprie, Yann LeCun, Deep multi-scale video prediction beyond mean square error, arXiv, 2015
• Mehdi Mirza, Simon Osindero, Conditional Generative Adversarial Nets, arXiv, 2014
• Takeru Miyato, Masanori Koyama, cGANs with Projection Discriminator, ICLR, 2018
• Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, XiaoleiHuang, Dimitris Metaxas, StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks, arXiv, 2017
• Augustus Odena, Christopher Olah, Jonathon Shlens, Conditional Image Synthesis With Auxiliary Classifier GANs, ICML, 2017
Reference
• Conditional Generation• Yaroslav Ganin, Victor Lempitsky, Unsupervised Domain Adaptation by
Backpropagation, ICML, 2015
• Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, Domain-Adversarial Training of Neural Networks, JMLR, 2016
• Che-Ping Tsai, Hung-Yi Lee, Adversarial Learning of Label Dependency: A Novel Framework for Multi-class Classification, submitted to ICASSP 2019
Reference
• Unsupervised Conditional Generation• Jun-Yan Zhu, Taesung Park, Phillip Isola, Alexei A. Efros, Unpaired Image-to-
Image Translation using Cycle-Consistent Adversarial Networks, ICCV, 2017
• Zili Yi, Hao Zhang, Ping Tan, Minglun Gong, DualGAN: Unsupervised Dual Learning for Image-to-Image Translation, ICCV, 2017
• Tomer Galanti, Lior Wolf, Sagie Benaim, The Role of Minimal Complexity Functions in Unsupervised Learning of Semantic Mappings, ICLR, 2018
• Yaniv Taigman, Adam Polyak, Lior Wolf, Unsupervised Cross-Domain Image Generation, ICLR, 2017
• Asha Anoosheh, Eirikur Agustsson, Radu Timofte, Luc Van Gool, ComboGAN: Unrestrained Scalability for Image Domain Translation, arXiv, 2017
• Amélie Royer, Konstantinos Bousmalis, Stephan Gouws, Fred Bertsch, Inbar Mosseri, Forrester Cole, Kevin Murphy, XGAN: Unsupervised Image-to-Image Translation for Many-to-Many Mappings, arXiv, 2017
Reference
• Unsupervised Conditional Generation• Guillaume Lample, Neil Zeghidour, Nicolas Usunier, Antoine Bordes, Ludovic
Denoyer, Marc'Aurelio Ranzato, Fader Networks: Manipulating Images by Sliding Attributes, NIPS, 2017
• Taeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jung Kwon Lee, Jiwon Kim, Learning to Discover Cross-Domain Relations with Generative Adversarial Networks, ICML, 2017
• Ming-Yu Liu, Oncel Tuzel, “Coupled Generative Adversarial Networks”, NIPS, 2016
• Ming-Yu Liu, Thomas Breuel, Jan Kautz, Unsupervised Image-to-Image Translation Networks, NIPS, 2017
• Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, Jaegul Choo, StarGAN: Unified Generative Adversarial Networks for Multi-Domain Image-to-Image Translation, arXiv, 2017
Outline
Part I: General Introduction of Generative Adversarial Network (GAN)
Part II: Applications to Natural Language Processing
Part III: Applications to Speech Processing
Unsupervised Conditional Generation
Image Style Transfer
Text Style Transfer
It is good.
It’s a good day.I love you.
It is bad.
It’s a bad day.
I don’t love you.
positive negative
photos Vincent van Gogh’s paintings
Not Paired
Not Paired
Cycle GAN
𝐺𝑋→𝑌 𝐺Y→X
as close as possible
𝐺Y→X 𝐺𝑋→𝑌
as close as possible
𝐷𝑌𝐷𝑋scalar: belongs to domain Y or not
scalar: belongs to domain X or not
Cycle GAN
𝐺𝑋→𝑌 𝐺Y→X
as close as possible
𝐺Y→X 𝐺𝑋→𝑌
as close as possible
𝐷𝑌𝐷𝑋negative sentence? positive sentence?
It is bad. It is good. It is bad.
I love you. I hate you. I love you.positive
positive
positivenegative
negative negative
Discrete Issue
𝐺𝑋→𝑌
𝐷𝑌 positive sentence?
It is bad. It is good.positivenegative
large network
hidden layer
update
fix
Backpropagation
with discrete output
Seq2seq model
Three Categories of Solutions
Gumbel-softmax
• [Matt J. Kusner, et al, arXiv, 2016]
Continuous Input for Discriminator
• [Sai Rajeswar, et al., arXiv, 2017][Ofir Press, et al., ICML workshop, 2017][Zhen Xu, et al., EMNLP, 2017][Alex Lamb, et al., NIPS, 2016][Yizhe Zhang, et al., ICML, 2017]
“Reinforcement Learning”
• [Yu, et al., AAAI, 2017][Li, et al., EMNLP, 2017][Tong Che, et al, arXiv, 2017][Jiaxian Guo, et al., AAAI, 2018][Kevin Lin, et al, NIPS, 2017][William Fedus, et al., ICLR, 2018]
Cycle GAN
𝐺𝑋→𝑌 𝐺Y→X
as close as possible
𝐺Y→X 𝐺𝑋→𝑌
as close as possible
𝐷𝑌𝐷𝑋negative sentence? positive sentence?
It is bad. It is good. It is bad.
I love you. I hate you. I love you.positive
positive
positivenegative
negative negative
Word embedding[Lee, et al., ICASSP, 2018]
Discrete?
Cycle GAN
• Negative sentence to positive sentence:it's a crappy day → it's a great dayi wish you could be here → you could be hereit's not a good idea → it's good ideai miss you → i love youi don't love you → i love youi can't do that → i can do thati feel so sad → i happyit's a bad day → it's a good dayit's a dummy day → it's a great daysorry for doing such a horrible thing → thanks for doing a great thingmy doggy is sick → my doggy is my doggymy little doggy is sick → my little doggy is my little doggy
Thinks Yau-Shian Wang for providing the results.
✘ Negative sentence to positive sentence:
Cycle GAN
感謝張瓊之同學提供實驗結果
胃疼 , 沒睡醒 , 各種不舒服 -> 生日快樂 , 睡醒 , 超級舒服
我都想去上班了, 真夠賤的! -> 我都想去睡了, 真帥的 !
暈死了, 吃燒烤、竟然遇到個變態狂 -> 哈哈好 ~ , 吃燒烤 ~ 竟然遇到帥狂
我肚子痛的厲害 -> 我生日快樂厲害
感冒了, 難受的說不出話來了 ! -> 感冒了, 開心的說不出話來 !
𝐸𝑁𝑋
𝐸𝑁𝑌 𝐷𝐸𝑌
𝐷𝐸𝑋 𝐷𝑋
𝐷𝑌
Discriminator of X domain
Discriminator of Y domain
Projection to Common Space
PositiveSentence
PositiveSentence
NegativeSentence
NegativeSentence
Decoder hidden layer as discriminator input [Shen, et al., NIPS, 2017]
From 𝐸𝑁𝑋 or 𝐸𝑁𝑌Domain
Discriminator
𝐸𝑁𝑋 and 𝐸𝑁𝑌 fool the domain discriminator
[Zhao, et al., ICML 2018]
[Fu, et al., AAAI, 2018]
Unsupervised Conditional Generation
Image Style Transfer
Text Style Transfer
photos Vincent van Gogh’s paintings
Not Paired
Not Paired
summarydocument
This is unsupervised abstractive summarization.
Abstractive Summarization
• Now machine can do abstractive summary by seq2seq (write summaries in its own words)
summary 1
summary 2
summary 3
Training Data
summary
seq2seq(in its own words)
Supervised: We need lots of labelled training data.
Unsupervised Abstractive Summarization• Now machine can do abstractive summary by
seq2seq (write summaries in its own words)
summary 1
summary 2
summary 3
seq2seq document
Domain X Domain Y
G
Seq2seq
documentword
sequence
D
Human written summaries Real or not
Discriminator
Unsupervised Abstractive Summarization
Summary?
G
Seq2seq
documentword
sequence
D
Human written summaries Real or not
Discriminator
R
Seq2seq
document
Unsupervised Abstractive Summarization
minimize the reconstruction error
Unsupervised Abstractive Summarization
G RSummary?
Seq2seq Seq2seq
document documentword
sequence
Only need a lot of documents to train the model
This is a seq2seq2seq auto-encoder.
Using a sequence of words as latent representation.
not readable …
Unsupervised Abstractive Summarization
G R
Seq2seq Seq2seq
word sequence
D
Human written summaries Real or not
DiscriminatorLet Discriminator considers
my output as real
document document
Summary?
Readable
REINFORCE algorithm to deal with the discrete issue
Experimental results
ROUGE-1 ROUGE-2 ROUGE-L
Supervised 33.2 14.2 30.5
Trivial 21.9 7.7 20.5
Unsupervised(matched data)
28.1 10.0 25.4
Unsupervised(no matched data)
27.2 9.1 24.1
English Gigaword (Document title as summary)
• Matched data: using the title of English Gigaword to train Discriminator
• No matched data: using the title of CNN/Diary Mail to train Discriminator
Semi-supervised Learning
25
26
27
28
29
30
31
32
33
34
0 10k 500k
RO
UG
E-1
Number of document-summary pairs used
WGAN Reinforce Supervised
Using matched data
3.8M pairs are used.Approaches to deal with the discrete issue.
unsupervised
semi-supervised
Unsupervised Abstractive Summarization• Document:澳大利亞今天與13個國家簽署了反興奮劑雙邊協議,旨在加強體育競賽之外的藥品檢查並共享研究成果 ……
• Summary:
• Human:澳大利亞與13國簽署反興奮劑協議
• Unsupervised:澳大利亞加強體育競賽之外的藥品檢查
• Document:中華民國奧林匹克委員會今天接到一九九二年冬季奧運會邀請函,由於主席張豐緒目前正在中南美洲進行友好訪問,因此尚未決定是否派隊赴賽 ……
• Summary:
• Human:一九九二年冬季奧運會函邀我參加
• Unsupervised:奧委會接獲冬季奧運會邀請函
感謝王耀賢同學提供實驗結果
Unsupervised Abstractive Summarization• Document:據此間媒體27日報道,印度尼西亞蘇門答臘島的兩個省近日來連降暴雨,洪水泛濫導致塌方,到26日為止至少已有60人喪生,100多人失蹤 ……
• Summary:
• Human:印尼水災造成60人死亡
• Unsupervised:印尼門洪水泛濫導致塌雨
• Document:安徽省合肥市最近為領導幹部下基層做了新規定:一律輕車簡從,不準搞迎來送往、不準搞層層陪同 ……
• Summary:
• Human:合肥規定領導幹部下基層活動從簡
• Unsupervised:合肥領導幹部下基層做搞迎來送往規定:一律簡
感謝王耀賢同學提供實驗結果
Outline
Part I: General Introduction of Generative Adversarial Network (GAN)
Part II: Applications to Natural Language Processing
Part III: Applications to Speech Processing
Unsupervised Conditional Generation
Image Style Transfer
Speech Style Transfer
photos Vincent van Gogh’s paintings
Not Paired
Not Paired
This is unsupervised voice conversion.
Speaker A Speaker B
Voice Conversion
Voice Conversion
• The same sentence has different impact when it is said by different people.
Do you want to study a PhD?
Go away! Student
Student
Do you want to study a PhD?新垣結衣
(Aragaki Yui)
In the past
With GAN
Speaker A Speaker B
How are you? How are you?
Good morning Good morning
Speaker A Speaker B
天氣真好 How are you?
再見囉 Good morning
Speakers A and B are talking about completely different things.
Cycle GAN
𝐺𝑋→𝑌 𝐺Y→X
as close as possible
𝐺Y→X 𝐺𝑋→𝑌
as close as possible
𝐷𝑌𝐷𝑋scalar: belongs to domain Y or not
scalar: belongs to domain X or not
Cycle GANfor Voice Conversion
𝐺𝑋→𝑌 𝐺Y→X
as close as possible
𝐺Y→X 𝐺𝑋→𝑌
as close as possible
𝐷𝑌𝐷𝑋scalar: belongs to domain Y or not
scalar: belongs to domain X or not
X: Speaker A, Y: Speaker B[Takuhiro Kaneko, et. al, arXiv, 2017][Fuming Fang, et. al, ICASSP, 2018][Yang Gao, et. al, ICASSP, 2018]
spectrogram spectrogram
𝐸𝑁𝑋
𝐸𝑁𝑌 𝐷𝐸𝑌
𝐷𝐸𝑋
Projection to Common Space
𝐸𝑁𝑋
𝐸𝑁𝑌 𝐷𝐸𝑌
𝐷𝐸𝑋
Projection to Common Space
• All the speakers share the same encoder.
• The model can deal with the speakers never seen during training.
Encoder
Encoder
𝐸𝑁𝑋
𝐸𝑁𝑌 𝐷𝐸𝑌
𝐷𝐸𝑋
Projection to Common Space
All the speakers also share the same decoder.
Encoder
Encoder Decoder
Decoder
Which speakers?Discriminator
We hope that encoder can extract the phonetic information while removing the speaker information.
The encoder fools the discriminator.
Use a vector (one-hot) to represent speaker identity.
Encoder
Decoder
reconstructed
Training
Testing
Decoder
A is reading the sentence of B
Projection to Common Space
A How are you?
A
Which speakers?
Discriminator
B Hello
AHello
AEncoder
phonetic information
EncoderHow are you?
Which speakers?
Discriminator
Does it contain phonetic information?
Different colors: different words
Different colors: different speakers
“Audio” Word to Vector
Encoder
Decoder
reconstructed
Training
Testing
Decoder
A is reading the sentence of B
Issues
A How are you?
A
Which speakers?
Discriminator
B Hello
AHello
AEncoder
Different Speakers
The Same Speakers
Low Quality
2nd Stage Training
Discriminatorreal or generated?
which speaker?
Cheat discriminator
Speaker
Classifier
Help speaker classifier
DecoderB Hello
A
AEncoder
Extra Criterion for Training
Hello
Different SpeakersNo learning target???
• Subjective evaluations(20 speakers in VCTK)
Experimental Results[Chou et al., INTERSPEECH, 2018]
“Two stages” is better
“One stage” is better
Indistinguishable
“Projection” is better
“Cycle GAN” is better
Indistinguishable
Demo
Target:
Source:
Source to Target:
https://jjery2243542.github.io/voice_conversion_demo/
Thanks Ju-chieh Chou for providing the results.
Decoder
A is reading the sentence of B
B Hello
AHello
AEncoder
Source SpeakerTarget
Speaker
Source Speaker
Target Speaker
Me
https://jjery2243542.github.io/voice_conversion_demo/
(Never seen during training!)
Me
Me
Source to Target
Thanks Ju-chieh Chou for providing the results.
Me
Unsupervised Conditional Generation
AudioNot Paired
This is unsupervised speech recognition.
Text
Supervised Speech Recognition
https://devopedia.org/images/article/102/9180.1532710057.png
(I believe you have seen similar figures before.)
• Supervised learning needs lots of annotated speech.• However, most of the languages are low resourced.
Speech Recognition in the Future
http://www.parenting.com/article/teach-baby-to-talk
Learning human language with very little supervision
Unsupervised Speech Recognition
• Machine learns to recognize speech from unparallel speech and text.
audio collection(without text annotation)
text documents(not parallel to audio)
This idea was too crazy to be realized in the past.
However, it becomes possible with GAN recently.
[Liu, et al., INTERSPEECH, 2018]
[Chen, et al., arXiv, 2018]
Acoustic Token Discovery
Acoustic tokens: chunks of acoustically similar audio segments with token IDs [Zhang & Glass, ASRU 09]
[Huijbregts, ICASSP 11][Chan & Lee, Interspeech 11]
Acoustic tokens can be discovered from audio collection without text annotation.
Acoustic Token Discovery
Token 1
Token 1
Token 1Token 2
Token 3
Token 3
Token 3
Acoustic tokens: chunks of acoustically similar audio segments with token IDs [Zhang & Glass, ASRU 09]
[Huijbregts, ICASSP 11][Chan & Lee, Interspeech 11]
Acoustic tokens can be discovered from audio collection without text annotation.
Token 2
Token 4
Acoustic Token Discovery
Phonetic-level acoustic tokens are obtained by segmental sequence-to-sequence autoencoder.
[Wang, et al., ICASSP, 2018]
Unsupervised Speech Recognition
AY L AH V Y UW
G UH D B AY
HH AW AA R Y UW
T AY W AA N
AY M F AY NCycleGAN
“AY”=
Phone-level Acoustic Pattern Discovery
p1
p1 p3 p2
p1 p4 p3 p5 p5
p1 p5 p4 p3
p1 p2 p3 p4
Phoneme sequences from Text
[Liu, et al., INTERSPEECH,
2018]
[Chen, et al., arXiv, 2018]
The progress of supervised learning
Acc
ura
cy
Unsupervised learning today (2019) is as good as supervised learning 30 years ago.
The image is modified from: Phone recognition on the TIMIT database Lopes, C. and Perdigão, F., 2011. Speech Technologies, Vol 1, pp. 285--302.
Concluding Remarks
Part I: General Introduction of Generative Adversarial Network (GAN)
Part II: Applications to Natural Language Processing
Part III: Applications to Speech Processing
To Learn More …
https://www.youtube.com/playlist?list=PLJV_el3uVTsMd2G9ZjcpJn1YfnM9wVOBf
You can learn more from the YouTube Channel
(in Mandarin)