Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative...

Generative Adversarial Network and its Applications to Speech Processing

and Natural Language Processing

Hung-yi Lee

Mihaela Rosca, Balaji Lakshminarayanan, David Warde-Farley, Shakir Mohamed, “Variational Approaches for Auto-Encoding Generative Adversarial Networks”, arXiv, 2017

All Kinds of GAN … https://github.com/hindupuravinash/the-gan-zoo

GAN

ACGAN

BGAN

DCGAN

EBGAN

fGAN

GoGAN

CGAN

……

Outline

Part I: General Introduction of Generative Adversarial Network (GAN)

Part II: Applications to Speech Processing

Part III: Applications to Natural Language Processing

Outline of Part 1

Generation by GAN

Conditional Generation

Unsupervised Conditional Generation

Relation to Reinforcement Learning

Outline of Part 1

Generation by GAN

• Image Generation as Example

• Theory behind GAN

• Issues and Possible Solutions




Anime Face Generation

Draw

Generator

Examples

Basic Idea of GAN

Generator

It is a neural network (NN), or a function.

Generator

0.1−3⋮2.40.9

imagevector

Generator

3−3⋮2.40.9

Generator

0.12.1⋮5.40.9

Generator

0.1−3⋮2.43.5

high dimensional vector

Powered by: http://mattya.github.io/chainer-DCGAN/

Each dimension of input vector represents some characteristics.

Longer hair

blue hair Open mouth

Discri-minator

scalarimage

Basic Idea of GAN It is a neural network (NN), or a function.

Larger value means real, smaller value means fake.

Discri-minator

Discri-minator

Discri-minator1.0 1.0

0.1 Discri-minator

0.1

• Initialize generator and discriminator

• In each training iteration:

DG

sample

generated objects

G

Algorithm

D

Update

vector

vector

vector

vector

0000

1111

randomly sampled

Database

Step 1: Fix generator G, and update discriminator D

Discriminator learns to assign high scores to real objects and low scores to generated objects.

Fix



DG

Algorithm

Step 2: Fix discriminator D, and update generator G

Discri-minator

NNGenerator

vector

0.13

hidden layer

update fix

Gradient Ascent

large network

Generator learns to “fool” the discriminator



DG

Learning D

Sample some real objects:

Generate some fake objects:

G

Algorithm

D

Update

Learning G

G Dimage

1111

imageimage

image1

update fix

0000

vector

vector

vector

vector

vector

vector

vector

vector

fix

The faces generated by machine.

The images are generated by Yen-Hao Chen, Po-Chun Chien, Jun-Chen Xie, Tsung-Han Wu.

0.00.0

G

0.90.9

G

0.10.1

G

0.20.2

G

0.30.3

G

0.40.4

G

0.50.5

G

0.60.6

G

0.70.7

G

0.80.8

G

Amazing Results! [Tero Karras, et al., ICLR, 2018]

Amazing Results! [Andrew Brock, et al., arXiv, 2018]

(Variational) Auto-encoder

As close as possible

NNEncoder

NNDecoder

cod

e

NNDecoder

cod

e

Randomly generate a vector as code Image ?

= Generator

= Generator

Auto-encoder v.s. GAN

As close as possibleNNDecoder

cod

e

Genera-tor

cod

e

= Generator

Discri-minator

If discriminator does not simply memorize the images,

Generator learns the patterns of faces.

Just copy an image

1

Auto-encoder

GAN

Fuzzy …

FID[Martin Heusel, et al., NIPS, 2017]: Smaller is better

[Mario Lucic, et al. arXiv, 2017]

Outline of Part 1

Generation







Generator

• A generator G is a network. The network defines a probability distribution 𝑃𝐺

generator G𝑧 𝑥 = 𝐺 𝑧

Normal Distribution

𝑃𝐺(𝑥) 𝑃𝑑𝑎𝑡𝑎 𝑥

as close as possible

How to compute the divergence?

𝐺∗ = 𝑎𝑟𝑔min𝐺

𝐷𝑖𝑣 𝑃𝐺 , 𝑃𝑑𝑎𝑡𝑎Divergence between distributions 𝑃𝐺 and 𝑃𝑑𝑎𝑡𝑎

𝑥: an image (a high-dimensional vector)

Discriminator𝐺∗ = 𝑎𝑟𝑔min

𝐺𝐷𝑖𝑣 𝑃𝐺 , 𝑃𝑑𝑎𝑡𝑎

Although we do not know the distributions of 𝑃𝐺 and 𝑃𝑑𝑎𝑡𝑎, we can sample from them.

sample

G

vector

vector

vector

vector

sample from normal

Database

Sampling from 𝑷𝑮

Sampling from 𝑷𝒅𝒂𝒕𝒂

Discriminator 𝐺∗ = 𝑎𝑟𝑔min𝐺

𝐷𝑖𝑣 𝑃𝐺 , 𝑃𝑑𝑎𝑡𝑎

Discriminator

: data sampled from 𝑃𝑑𝑎𝑡𝑎: data sampled from 𝑃𝐺

train

𝑉 𝐺,𝐷 = 𝐸𝑥∼𝑃𝑑𝑎𝑡𝑎 𝑙𝑜𝑔𝐷 𝑥 + 𝐸𝑥∼𝑃𝐺 𝑙𝑜𝑔 1 − 𝐷 𝑥

Example Objective Function for D

(G is fixed)

𝐷∗ = 𝑎𝑟𝑔max𝐷

𝑉 𝐷, 𝐺Training:

Using the example objective function is exactly the same as training a binary classifier.

[Goodfellow, et al., NIPS, 2014]

The maximum objective value is related to JS divergence.

Discriminator 𝐺∗ = 𝑎𝑟𝑔min𝐺

𝐷𝑖𝑣 𝑃𝐺 , 𝑃𝑑𝑎𝑡𝑎

Discriminator

: data sampled from 𝑃𝑑𝑎𝑡𝑎: data sampled from 𝑃𝐺

train

hard to discriminatesmall divergence

Discriminatortrain

easy to discriminatelarge divergence


𝑉 𝐷, 𝐺

Training:

Small max𝐷

𝑉 𝐷, 𝐺

𝐺∗ = 𝑎𝑟𝑔min𝐺

𝐷𝑖𝑣 𝑃𝐺 , 𝑃𝑑𝑎𝑡𝑎max𝐷

𝑉 𝐺,𝐷

The maximum objective value is related to JS divergence.



Step 1: Fix generator G, and update discriminator D

Step 2: Fix discriminator D, and update generator G


𝑉 𝐷, 𝐺

[Goodfellow, et al., NIPS, 2014]

Using the divergence you like ☺[Sebastian Nowozin, et al., NIPS, 2016]

Can we use other divergence?

Outline of Part 1

Generation







More tips and tricks:https://github.com/soumith/ganhacks

GAN is hard to train ……

• There is a saying ……

(I found this joke from 陳柏文’s facebook.)

JS divergence is not suitable

• In most cases, 𝑃𝐺 and 𝑃𝑑𝑎𝑡𝑎 are not overlapped.

• 1. The nature of data

• 2. Sampling

Both 𝑃𝑑𝑎𝑡𝑎 and 𝑃𝐺 are low-dim manifold in high-dim space.

𝑃𝑑𝑎𝑡𝑎𝑃𝐺

The overlap can be ignored.

Even though 𝑃𝑑𝑎𝑡𝑎 and 𝑃𝐺have overlap.

If you do not have enough sampling ……

𝑃𝑑𝑎𝑡𝑎𝑃𝐺0 𝑃𝑑𝑎𝑡𝑎𝑃𝐺1

𝐽𝑆 𝑃𝐺0 , 𝑃𝑑𝑎𝑡𝑎= 𝑙𝑜𝑔2

𝑃𝑑𝑎𝑡𝑎𝑃𝐺100

……


𝐽𝑆 𝑃𝐺100 , 𝑃𝑑𝑎𝑡𝑎= 0

What is the problem of JS divergence?

……

JS divergence is log2 if two distributions do not overlap.

Intuition: If two distributions do not overlap, binary classifier achieves 100% accuracy

Equally bad

Same objective value is obtained. Same divergence

Wasserstein distance

• Considering one distribution P as a pile of earth, and another distribution Q as the target

• The average distance the earth mover has to move the earth.

𝑃 𝑄

d

𝑊 𝑃,𝑄 = 𝑑

Wasserstein distance

Source of image: https://vincentherrmann.github.io/blog/wasserstein/

𝑃

𝑄

Using the “moving plan” with the smallest average distance to define the Wasserstein distance.

There are many possible “moving plans”.

Smaller distance?

Larger distance?

𝑃𝑑𝑎𝑡𝑎𝑃𝐺0 𝑃𝑑𝑎𝑡𝑎𝑃𝐺1


𝑃𝑑𝑎𝑡𝑎𝑃𝐺100

……


𝐽𝑆 𝑃𝐺100 , 𝑃𝑑𝑎𝑡𝑎= 0

What is the problem of JS divergence?

𝑊 𝑃𝐺0 , 𝑃𝑑𝑎𝑡𝑎= 𝑑0

𝑊 𝑃𝐺1 , 𝑃𝑑𝑎𝑡𝑎= 𝑑1

𝑊 𝑃𝐺100 , 𝑃𝑑𝑎𝑡𝑎= 0

𝑑0 𝑑1

……

……

Better!

WGAN

𝑉 𝐺,𝐷= max

𝐷∈1−𝐿𝑖𝑝𝑠𝑐ℎ𝑖𝑡𝑧𝐸𝑥~𝑃𝑑𝑎𝑡𝑎 𝐷 𝑥 − 𝐸𝑥~𝑃𝐺 𝐷 𝑥

Evaluate wasserstein distance between 𝑃𝑑𝑎𝑡𝑎 and 𝑃𝐺

[Martin Arjovsky, et al., arXiv, 2017]

How to fulfill this constraint?D has to be smooth enough.

real

−∞

generated

D

∞Without the constraint, the training of D will not converge.

Keeping the D smooth forces D(x) become ∞ and −∞

• Original WGAN → Weight Clipping [Martin Arjovsky, et al., arXiv, 2017]

• Improved WGAN → Gradient Penalty [Ishaan Gulrajani, NIPS, 2017]

• Spectral Normalization → Keep gradient norm smaller than 1 everywhere [Miyato, et al., ICLR, 2018]

Force the parameters w between c and -c

After parameter update, if w > c, w = c; if w < -c, w = -c

Keep the gradient close to 1

𝑉 𝐺,𝐷 = max𝐷∈1−𝐿𝑖𝑝𝑠𝑐ℎ𝑖𝑡𝑧

𝐸𝑥~𝑃𝑑𝑎𝑡𝑎 𝐷 𝑥 − 𝐸𝑥~𝑃𝐺 𝐷 𝑥

real

samples

Keep the gradient close to 1

[Kodali, et al., arXiv, 2017][Wei, et al., ICLR, 2018]

Energy-based GAN (EBGAN)

• Using an autoencoder as discriminator D

Discriminator

0 for the best images

Generator is the same.

-0.1

EN DE

Autoencoder

X -1 -0.1

[Junbo Zhao, et al., arXiv, 2016]

➢Using the negative reconstruction error of auto-encoder to determine the goodness

➢Benefit: The auto-encoder can be pre-train by real images without generator.

Tip: Improve Quality during Testing

generator G

Normal Distribution

generator G

Smaller Variance

本技巧由柯達方提供

Some samples are poor.

The output would be more stable, but sacrifice the diversity.

This tip is also used in [Andrew Brock, et al., arXiv, 2018]

Mode Collapse

: real data

: generated data

Training with too many iterations ……

Mode Dropping

BEGAN on CelebA

Generator at iteration t

Generator at iteration t+1

Generator at iteration t+2

Generator switches mode during training

Tip: Ensemble

Generator1

Generator2

……

……

Train a set of generators: 𝐺1, 𝐺2, ⋯ , 𝐺𝑁

Random pick a generator 𝐺𝑖, and then use 𝐺𝑖 to generate the image

To generate an image

Objective Evaluation

Off-the-shelf Image Classifier

𝑥 𝑃 𝑦|𝑥

Concentrated distribution means higher visual quality

CNN𝑥1 𝑃 𝑦1|𝑥1

Uniform distribution means higher variety



…

𝑃 𝑦 =1

𝑁

𝑛

𝑃 𝑦𝑛|𝑥𝑛

[Tim Salimans, et al., NIPS, 2016]

𝑥: image

𝑦: class (output of CNN)

e.g. Inception net, VGG, etc.

class 1

class 2

class 3

Outline of Part 1

Generation




• Original Generator

• Conditional Generator

G

condition c

𝑧 𝑥 = 𝐺 𝑧

Normal Distribution G

𝑃𝐺 𝑥 → 𝑃𝑑𝑎𝑡𝑎 𝑥

𝑧

Normal Distribution 𝑥 = 𝐺 𝑐, 𝑧

𝑃𝐺 𝑥|𝑐 → 𝑃𝑑𝑎𝑡𝑎 𝑥|𝑐

[Mehdi Mirza, et al., arXiv, 2014]

NNGenerator

“Girl with red hair and red eyes”

“Girl with yellow ribbon”

e.g. Text-to-Image

Target of NN output

Text-to-Image

• Traditional supervised approach

NN Image

Text: “train”

a dog is running

a bird is flying

A blurry image!

c1: a dog is running


Conditional GAN

D (original)

scalar𝑥

G𝑧Normal distribution

x = G(c,z)c: train

x is real image or not

Image

Real images:

Generated images:

1

0

Generator will learn to generate realistic images ….

But completely ignore the input conditions.

[Scott Reed, et al, ICML, 2016]

Conditional GAN

D (better)

scalar𝑐

𝑥

True text-image pairs:

G𝑧Normal distribution

x = G(c,z)c: train

Image

x is realistic or not + c and x are matched or not

(train , )

(train , )(cat , )

[Scott Reed, et al, ICML, 2016]

1

00

x is realistic or not + c and x are matched or not

Conditional GAN - Discriminator

[Takeru Miyato, et al., ICLR, 2018]

[Han Zhang, et al., arXiv, 2017]

[Augustus Odena et al., ICML, 2017]

condition c

object x

NetworkNetwork

Network

score

Network

Network

(almost every paper)

condition c

object x

c and x are matched or not

x is realistic or not

Conditional GAN

paired data

blue eyesred hairshort hair

Collecting anime faces and the description of its characteristics

red hair,green eyes

blue hair,red eyes

The images are generated by Yen-Hao Chen, Po-Chun Chien, Jun-Chen Xie, Tsung-Han Wu.

Conditional GAN - Image-to-image

G𝑧

x = G(c,z)𝑐

[Phillip Isola, et al., CVPR, 2017]

Image translation, or pix2pix



• Traditional supervised approach

NN Image

It is blurry.

Testing:

input L1

e.g. L1



Testing:

input L1 GAN

G𝑧

Image D scalar

GAN + L1

L1


Conditional GAN - Video Generation

Generator

Discriminator

Last frame is real or generated

Discriminator thinks it is real

[Michael Mathieu, et al., arXiv, 2015]

https://github.com/dyelax/Adversarial_Video_Generation

Conditional GAN - Sound-to-image

Gc: sound Image

"a dog barking sound"

Training Data Collection

video

Conditional GAN - Sound-to-image• Audio-to-image

https://wjohn1483.github.io/audio_to_scene/index.html

The images are generated by Chia-Hung Wan and Shun-Po Chuang.

Louder

Conditional GAN - Image-to-label

Multi-label Image Classifier = Conditional Generator

Input condition

Generated output

Conditional GAN - Image-to-labelF1 MS-COCO NUS-WIDE

VGG-16 56.0 33.9

+ GAN 60.4 41.2

Inception 62.4 53.5

+GAN 63.8 55.8

Resnet-101 62.8 53.1

+GAN 64.0 55.4

Resnet-152 63.3 52.1

+GAN 63.9 54.1

Att-RNN 62.1 54.7

RLSD 62.0 46.9

The classifiers can have different architectures.

The classifiers are trained as conditional GAN.

[Tsai, et al., submitted to ICASSP 2019]

Conditional GAN - Image-to-labelF1 MS-COCO NUS-WIDE

VGG-16 56.0 33.9

+ GAN 60.4 41.2

Inception 62.4 53.5

+GAN 63.8 55.8

Resnet-101 62.8 53.1

+GAN 64.0 55.4

Resnet-152 63.3 52.1

+GAN 63.9 54.1

Att-RNN 62.1 54.7

RLSD 62.0 46.9

The classifiers can have different architectures.

The classifiers are trained as conditional GAN.

Conditional GAN outperforms other models designed for multi-label.

Domain Adversarial Training

• Training and testing data are in different domains

Training data:

Testing data:

Generator

Generator

The same distribution

feature

feature

Take digit classification as example

blue points

red points


feature extractor (Generator)

Discriminator(Domain classifier)

image Which domain?

Always output zero vectors

Domain Classifier Fails


feature extractor (Generator)

Discriminator(Domain classifier)

image

Label predictor

Which digits?

Not only cheat the domain classifier, but satisfying label predictor at the same time

More speech-related applications in Part II.

Successfully applied on image classification [Ganin et al, ICML, 2015][Ajakan et al. JMLR, 2016 ]

Which domain?

Outline of Part 1

Generation




Unsupervised Conditional GAN

GObject in Domain X Object in Domain Y

Transform an object from one domain to another without paired data

Domain X Domain Y

photos

Condition Generated Object

Vincent van Gogh’s paintings

Not Paired

More Applications in Parts II and III

Use image style transfer as example here

Unsupervised Conditional Generation• Approach 1: Direct Transformation

• Approach 2: Projection to Common Space

?𝐺𝑋→𝑌

Domain X Domain Y

For texture or color change

𝐸𝑁𝑋 𝐷𝐸𝑌

Encoder of domain X

Decoder of domain Y

Larger change, only keep the semantics

Domain YDomain X FaceAttribute

?

Direct Transformation

𝐺𝑋→𝑌

Domain X

Domain Y

𝐷𝑌

Domain Y

Domain X

scalar

Input image belongs to domain Y or not

Become similar to domain Y


𝐺𝑋→𝑌

Domain X

Domain Y

𝐷𝑌

Domain Y

Domain X

scalar



Not what we want!

ignore input


𝐺𝑋→𝑌

Domain X

Domain Y

𝐷𝑌

Domain X

scalar



Not what we want!

ignore input

[Tomer Galanti, et al. ICLR, 2018]

The issue can be avoided by network design.

Simpler generator makes the input and output more closely related.


𝐺𝑋→𝑌

Domain X

Domain Y

𝐷𝑌

Domain X

scalar



EncoderNetwork

EncoderNetwork

pre-trained


Baseline of DTN [Yaniv Taigman, et al., ICLR, 2017]

Direct Transformation – Cycle GAN

𝐺𝑋→𝑌

𝐷𝑌

Domain Y

scalar


𝐺Y→X


Lack of information for reconstruction

[Jun-Yan Zhu, et al., ICCV, 2017]

Cycle consistency

Direct Transformation – Cycle GAN

𝐺𝑋→𝑌 𝐺Y→X


𝐺Y→X 𝐺𝑋→𝑌


𝐷𝑌𝐷𝑋scalar: belongs to domain Y or not

scalar: belongs to domain X or not

Cycle GAN

Dual GAN

Disco GAN

[Jun-Yan Zhu, et al., ICCV, 2017]

[Zili Yi, et al., ICCV, 2017]

[Taeksoo Kim, et

al., ICML, 2017]

For multiple domains, considering starGAN

[Yunjey Choi, arXiv, 2017]

Issue of Cycle Consistency

• CycleGAN: a Master of Steganography[Casey Chu, et al., NIPS workshop, 2017]

𝐺Y→X𝐺𝑋→𝑌

The information is hidden.

Unsupervised Conditional Generation• Approach 1: Direct Transformation

• Approach 2: Projection to Common Space

?𝐺𝑋→𝑌

Domain X Domain Y

For texture or color change

𝐸𝑁𝑋 𝐷𝐸𝑌

Encoder of domain X

Decoder of domain Y

Larger change, only keep the semantics

Domain YDomain X FaceAttribute

Domain X Domain Y

𝐸𝑁𝑋

𝐸𝑁𝑌 𝐷𝐸𝑌

𝐷𝐸𝑋image

image

image

imageFaceAttribute

Projection to Common Space

Target

Domain X Domain Y

𝐸𝑁𝑋


𝐷𝐸𝑋image

image

image

image

Minimizing reconstruction error

Projection to Common SpaceTraining

𝐸𝑁𝑋


𝐷𝐸𝑋image

image

image

image


Because we train two auto-encoders separately …

The images with the same attribute may not project to the same position in the latent space.

𝐷𝑋

𝐷𝑌

Discriminator of X domain

Discriminator of Y domain



Sharing the parameters of encoders and decoders


𝐸𝑁𝑋

𝐸𝑁𝑌

𝐷𝐸𝑋

𝐷𝐸𝑌

Couple GAN[Ming-Yu Liu, et al., NIPS, 2016]

UNIT[Ming-Yu Liu, et al., NIPS, 2017]

𝐸𝑁𝑋


𝐷𝐸𝑋image

image

image

image


The domain discriminator forces the output of 𝐸𝑁𝑋 and 𝐸𝑁𝑌 have the same distribution.

From 𝐸𝑁𝑋 or 𝐸𝑁𝑌

𝐷𝑋

𝐷𝑌




DomainDiscriminator

𝐸𝑁𝑋 and 𝐸𝑁𝑌 fool the domain discriminator

[Guillaume Lample, et al., NIPS, 2017]

𝐸𝑁𝑋


𝐷𝐸𝑋image

image

image

image

𝐷𝑋

𝐷𝑌




Cycle Consistency:

Used in ComboGAN [Asha Anoosheh, et al., arXiv, 017]


𝐸𝑁𝑋


𝐷𝐸𝑋image

image

image

image

𝐷𝑋

𝐷𝑌




Semantic Consistency:

Used in DTN [Yaniv Taigman, et al., ICLR, 2017] and XGAN [Amélie Royer, et al., arXiv, 2017]

To the same latent space

Outline of Part 1

Generation




Basic Components

EnvActorReward

Function

VideoGame

Go

Get 20 scores when killing a monster

The rule of GO

You cannot control

• Input of neural network: the observation of machine represented as a vector or a matrix

• Output neural network : each action corresponds to a neuron in output layer

……

NN as actor

pixelsfire

right

left

Score of an action

0.7

0.2

0.1

Take the action based on the probability.

Neural network as Actor

Actor, Environment, Reward

𝜏 = 𝑠1, 𝑎1, 𝑠2, 𝑎2, ⋯ , 𝑠𝑇 , 𝑎𝑇

Trajectory

Actor

𝑠1

𝑎1

Env

𝑠2

Env

𝑠1

𝑎1

Actor

𝑠2

𝑎2

Env

𝑠3

𝑎2

……

“right” “fire”

Reward Function → Discriminator

Reinforcement Learning v.s. GAN

Actor

𝑠1

𝑎1

Env

𝑠2

Env

𝑠1

𝑎1

Actor

𝑠2

𝑎2

Env

𝑠3

𝑎2

……

𝑅 𝜏 =

𝑡=1

𝑇

𝑟𝑡

Reward

𝑟1

Reward

𝑟2

“Black box”You cannot use backpropagation.

Actor → Generator Fixed

updatedupdated

Imitation Learning

We have demonstration of the expert.

Actor

𝑠1

𝑎1

Env

𝑠2

Env

𝑠1

𝑎1

Actor

𝑠2

𝑎2

Env

𝑠3

𝑎2

……

reward function is not available

Ƹ𝜏1, Ƹ𝜏2, ⋯ , Ƹ𝜏𝑁Each Ƹ𝜏 is a trajectory of the expert.

Self driving: record human drivers

Robot: grab the arm of robot

Inverse Reinforcement Learning

Reward Function

Environment

Optimal Actor

Inverse Reinforcement Learning

➢Using the reward function to find the optimal actor.

➢Modeling reward can be easier. Simple reward function can lead to complex policy.

Reinforcement Learning

Expert

demonstration of the expert

Ƹ𝜏1, Ƹ𝜏2, ⋯ , Ƹ𝜏𝑁

Framework of IRL

Expert ො𝜋

Actor 𝜋

ObtainReward Function R

Ƹ𝜏1, Ƹ𝜏2, ⋯ , Ƹ𝜏𝑁

𝜏1, 𝜏2, ⋯ , 𝜏𝑁

Find an actor based on reward function R

By Reinforcement learning

𝑛=1

𝑁

𝑅 Ƹ𝜏𝑛 >

𝑛=1

𝑁

𝑅 𝜏

Reward function → Discriminator

Actor → Generator

Reward Function R

The expert is always the best.

𝜏1, 𝜏2, ⋯ , 𝜏𝑁

GAN

IRL

G

D

High score for real, low score for generated

Find a G whose output obtains large score from D

Ƹ𝜏1, Ƹ𝜏2, ⋯ , Ƹ𝜏𝑁Expert

Actor

Reward Function

Larger reward for Ƹ𝜏𝑛,Lower reward for 𝜏

Find a Actor obtains large reward

Concluding Remarks

Generation




Reference

• Generation • Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-

Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio, Generative Adversarial Networks, NIPS, 2014

• Sebastian Nowozin, Botond Cseke, Ryota Tomioka, “f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization”, NIPS, 2016

• Martin Arjovsky, Soumith Chintala, Léon Bottou, Wasserstein GAN, arXiv, 2017

• Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, Aaron Courville, Improved Training of Wasserstein GANs, NIPS, 2017

• Junbo Zhao, Michael Mathieu, Yann LeCun, Energy-based Generative Adversarial Network, arXiv, 2016

• Mario Lucic, Karol Kurach, Marcin Michalski, Sylvain Gelly, Olivier Bousquet, “Are GANs Created Equal? A Large-Scale Study”, arXiv, 2017

• Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, Xi Chen Improved Techniques for Training GANs, NIPS, 2016

• Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, Sepp Hochreiter, GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium, NIPS, 2017

Reference

• Generation • Naveen Kodali, Jacob Abernethy, James Hays, Zsolt Kira, “On Convergence

and Stability of GANs”, arXiv, 2017

• Xiang Wei, Boqing Gong, Zixia Liu, Wei Lu, Liqiang Wang, Improving the Improved Training of Wasserstein GANs: A Consistency Term and Its Dual Effect, ICLR, 2018

• Takeru Miyato, Toshiki Kataoka, Masanori Koyama, Yuichi Yoshida, Spectral Normalization for Generative Adversarial Networks, ICLR, 2018

• Tero Karras, Timo Aila, Samuli Laine, Jaakko Lehtinen, Progressive Growing of GANs for Improved Quality, Stability, and Variation, ICLR, 2018

• Andrew Brock, Jeff Donahue, Karen Simonyan, Large Scale GAN Training for High Fidelity Natural Image Synthesis, arXiv, 2018

Reference

• Conditional Generation• Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt

Schiele, Honglak Lee, Generative Adversarial Text to Image Synthesis, ICML, 2016

• Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, Alexei A. Efros, Image-to-Image Translation with Conditional Adversarial Networks, CVPR, 2017

• Michael Mathieu, Camille Couprie, Yann LeCun, Deep multi-scale video prediction beyond mean square error, arXiv, 2015

• Mehdi Mirza, Simon Osindero, Conditional Generative Adversarial Nets, arXiv, 2014

• Takeru Miyato, Masanori Koyama, cGANs with Projection Discriminator, ICLR, 2018

• Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, XiaoleiHuang, Dimitris Metaxas, StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks, arXiv, 2017

• Augustus Odena, Christopher Olah, Jonathon Shlens, Conditional Image Synthesis With Auxiliary Classifier GANs, ICML, 2017

Reference

• Conditional Generation• Yaroslav Ganin, Victor Lempitsky, Unsupervised Domain Adaptation by

Backpropagation, ICML, 2015

• Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, Domain-Adversarial Training of Neural Networks, JMLR, 2016

• Che-Ping Tsai, Hung-Yi Lee, Adversarial Learning of Label Dependency: A Novel Framework for Multi-class Classification, submitted to ICASSP 2019

Reference

• Unsupervised Conditional Generation• Jun-Yan Zhu, Taesung Park, Phillip Isola, Alexei A. Efros, Unpaired Image-to-

Image Translation using Cycle-Consistent Adversarial Networks, ICCV, 2017

• Zili Yi, Hao Zhang, Ping Tan, Minglun Gong, DualGAN: Unsupervised Dual Learning for Image-to-Image Translation, ICCV, 2017

• Tomer Galanti, Lior Wolf, Sagie Benaim, The Role of Minimal Complexity Functions in Unsupervised Learning of Semantic Mappings, ICLR, 2018

• Yaniv Taigman, Adam Polyak, Lior Wolf, Unsupervised Cross-Domain Image Generation, ICLR, 2017

• Asha Anoosheh, Eirikur Agustsson, Radu Timofte, Luc Van Gool, ComboGAN: Unrestrained Scalability for Image Domain Translation, arXiv, 2017

• Amélie Royer, Konstantinos Bousmalis, Stephan Gouws, Fred Bertsch, Inbar Mosseri, Forrester Cole, Kevin Murphy, XGAN: Unsupervised Image-to-Image Translation for Many-to-Many Mappings, arXiv, 2017

Reference

• Unsupervised Conditional Generation• Guillaume Lample, Neil Zeghidour, Nicolas Usunier, Antoine Bordes, Ludovic

Denoyer, Marc'Aurelio Ranzato, Fader Networks: Manipulating Images by Sliding Attributes, NIPS, 2017

• Taeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jung Kwon Lee, Jiwon Kim, Learning to Discover Cross-Domain Relations with Generative Adversarial Networks, ICML, 2017

• Ming-Yu Liu, Oncel Tuzel, “Coupled Generative Adversarial Networks”, NIPS, 2016

• Ming-Yu Liu, Thomas Breuel, Jan Kautz, Unsupervised Image-to-Image Translation Networks, NIPS, 2017

• Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, Jaegul Choo, StarGAN: Unified Generative Adversarial Networks for Multi-Domain Image-to-Image Translation, arXiv, 2017

Outline


Part II: Applications to Natural Language Processing

Part III: Applications to Speech Processing


Image Style Transfer

Text Style Transfer

It is good.

It’s a good day.I love you.

It is bad.

It’s a bad day.

I don’t love you.

positive negative

photos Vincent van Gogh’s paintings

Not Paired

Not Paired

Cycle GAN







Cycle GAN





𝐷𝑌𝐷𝑋negative sentence? positive sentence?

It is bad. It is good. It is bad.

I love you. I hate you. I love you.positive

positive

positivenegative

negative negative

Discrete Issue

𝐺𝑋→𝑌

𝐷𝑌 positive sentence?

It is bad. It is good.positivenegative

large network

hidden layer

update

fix

Backpropagation

with discrete output

Seq2seq model

Three Categories of Solutions

Gumbel-softmax

• [Matt J. Kusner, et al, arXiv, 2016]

Continuous Input for Discriminator

• [Sai Rajeswar, et al., arXiv, 2017][Ofir Press, et al., ICML workshop, 2017][Zhen Xu, et al., EMNLP, 2017][Alex Lamb, et al., NIPS, 2016][Yizhe Zhang, et al., ICML, 2017]

“Reinforcement Learning”

• [Yu, et al., AAAI, 2017][Li, et al., EMNLP, 2017][Tong Che, et al, arXiv, 2017][Jiaxian Guo, et al., AAAI, 2018][Kevin Lin, et al, NIPS, 2017][William Fedus, et al., ICLR, 2018]

Cycle GAN





𝐷𝑌𝐷𝑋negative sentence? positive sentence?

It is bad. It is good. It is bad.

I love you. I hate you. I love you.positive

positive

positivenegative

negative negative

Word embedding[Lee, et al., ICASSP, 2018]

Discrete?

Cycle GAN

• Negative sentence to positive sentence:it's a crappy day → it's a great dayi wish you could be here → you could be hereit's not a good idea → it's good ideai miss you → i love youi don't love you → i love youi can't do that → i can do thati feel so sad → i happyit's a bad day → it's a good dayit's a dummy day → it's a great daysorry for doing such a horrible thing → thanks for doing a great thingmy doggy is sick → my doggy is my doggymy little doggy is sick → my little doggy is my little doggy

Thinks Yau-Shian Wang for providing the results.

✘ Negative sentence to positive sentence:

Cycle GAN

感謝張瓊之同學提供實驗結果

胃疼 , 沒睡醒 , 各種不舒服 -> 生日快樂 , 睡醒 , 超級舒服

我都想去上班了, 真夠賤的! -> 我都想去睡了, 真帥的 !

暈死了, 吃燒烤、竟然遇到個變態狂 -> 哈哈好 ~ , 吃燒烤 ~ 竟然遇到帥狂

我肚子痛的厲害 -> 我生日快樂厲害

感冒了, 難受的說不出話來了 ! -> 感冒了, 開心的說不出話來 !

𝐸𝑁𝑋


𝐷𝐸𝑋 𝐷𝑋

𝐷𝑌




PositiveSentence

PositiveSentence

NegativeSentence

NegativeSentence

Decoder hidden layer as discriminator input [Shen, et al., NIPS, 2017]

From 𝐸𝑁𝑋 or 𝐸𝑁𝑌Domain

Discriminator

𝐸𝑁𝑋 and 𝐸𝑁𝑌 fool the domain discriminator

[Zhao, et al., ICML 2018]

[Fu, et al., AAAI, 2018]



Text Style Transfer


Not Paired

Not Paired

summarydocument

This is unsupervised abstractive summarization.

Abstractive Summarization

• Now machine can do abstractive summary by seq2seq (write summaries in its own words)

summary 1

summary 2

summary 3

Training Data

summary

seq2seq(in its own words)

Supervised: We need lots of labelled training data.

Unsupervised Abstractive Summarization• Now machine can do abstractive summary by

seq2seq (write summaries in its own words)

summary 1

summary 2

summary 3

seq2seq document

Domain X Domain Y

G

Seq2seq

documentword

sequence

D

Human written summaries Real or not

Discriminator

Unsupervised Abstractive Summarization

Summary?

G

Seq2seq

documentword

sequence

D


Discriminator

R

Seq2seq

document


minimize the reconstruction error


G RSummary?

Seq2seq Seq2seq

document documentword

sequence

Only need a lot of documents to train the model

This is a seq2seq2seq auto-encoder.

Using a sequence of words as latent representation.

not readable …


G R

Seq2seq Seq2seq

word sequence

D


DiscriminatorLet Discriminator considers

my output as real

document document

Summary?

Readable

REINFORCE algorithm to deal with the discrete issue

Experimental results

ROUGE-1 ROUGE-2 ROUGE-L

Supervised 33.2 14.2 30.5

Trivial 21.9 7.7 20.5

Unsupervised(matched data)

28.1 10.0 25.4

Unsupervised(no matched data)

27.2 9.1 24.1

English Gigaword (Document title as summary)

• Matched data: using the title of English Gigaword to train Discriminator

• No matched data: using the title of CNN/Diary Mail to train Discriminator

Semi-supervised Learning

25

26

27

28

29

30

31

32

33

34

0 10k 500k

RO

UG

E-1

Number of document-summary pairs used

WGAN Reinforce Supervised

Using matched data

3.8M pairs are used.Approaches to deal with the discrete issue.

unsupervised

semi-supervised

Unsupervised Abstractive Summarization• Document:澳大利亞今天與13個國家簽署了反興奮劑雙邊協議,旨在加強體育競賽之外的藥品檢查並共享研究成果 ……

• Summary:

• Human:澳大利亞與13國簽署反興奮劑協議

• Unsupervised:澳大利亞加強體育競賽之外的藥品檢查

• Document:中華民國奧林匹克委員會今天接到一九九二年冬季奧運會邀請函,由於主席張豐緒目前正在中南美洲進行友好訪問,因此尚未決定是否派隊赴賽 ……

• Summary:

• Human:一九九二年冬季奧運會函邀我參加

• Unsupervised:奧委會接獲冬季奧運會邀請函

感謝王耀賢同學提供實驗結果

Unsupervised Abstractive Summarization• Document:據此間媒體27日報道,印度尼西亞蘇門答臘島的兩個省近日來連降暴雨,洪水泛濫導致塌方,到26日為止至少已有60人喪生,100多人失蹤 ……

• Summary:

• Human:印尼水災造成60人死亡

• Unsupervised:印尼門洪水泛濫導致塌雨

• Document:安徽省合肥市最近為領導幹部下基層做了新規定:一律輕車簡從,不準搞迎來送往、不準搞層層陪同 ……

• Summary:

• Human:合肥規定領導幹部下基層活動從簡

• Unsupervised:合肥領導幹部下基層做搞迎來送往規定:一律簡

感謝王耀賢同學提供實驗結果

Outline






Speech Style Transfer


Not Paired

Not Paired

This is unsupervised voice conversion.

Speaker A Speaker B

Voice Conversion

Voice Conversion

• The same sentence has different impact when it is said by different people.

Do you want to study a PhD?

Go away! Student

Student

Do you want to study a PhD?新垣結衣

(Aragaki Yui)

In the past

With GAN

Speaker A Speaker B

How are you? How are you?

Good morning Good morning

Speaker A Speaker B

天氣真好 How are you?

再見囉 Good morning

Speakers A and B are talking about completely different things.

Cycle GAN







Cycle GANfor Voice Conversion







X: Speaker A, Y: Speaker B[Takuhiro Kaneko, et. al, arXiv, 2017][Fuming Fang, et. al, ICASSP, 2018][Yang Gao, et. al, ICASSP, 2018]

spectrogram spectrogram

𝐸𝑁𝑋


𝐷𝐸𝑋


𝐸𝑁𝑋


𝐷𝐸𝑋


• All the speakers share the same encoder.

• The model can deal with the speakers never seen during training.

Encoder

Encoder

𝐸𝑁𝑋


𝐷𝐸𝑋


All the speakers also share the same decoder.

Encoder

Encoder Decoder

Decoder

Which speakers?Discriminator

We hope that encoder can extract the phonetic information while removing the speaker information.

The encoder fools the discriminator.

Use a vector (one-hot) to represent speaker identity.

Encoder

Decoder

reconstructed

Training

Testing

Decoder

A is reading the sentence of B


A How are you?

A

Which speakers?

Discriminator

B Hello

AHello

AEncoder

phonetic information

EncoderHow are you?

Which speakers?

Discriminator

Does it contain phonetic information?

Different colors: different words

Different colors: different speakers

“Audio” Word to Vector

Encoder

Decoder

reconstructed

Training

Testing

Decoder


Issues

A How are you?

A

Which speakers?

Discriminator

B Hello

AHello

AEncoder

Different Speakers

The Same Speakers

Low Quality

2nd Stage Training

Discriminatorreal or generated?

which speaker?

Cheat discriminator

Speaker

Classifier

Help speaker classifier

DecoderB Hello

A

AEncoder

Extra Criterion for Training

Hello

Different SpeakersNo learning target???

• Subjective evaluations(20 speakers in VCTK)

Experimental Results[Chou et al., INTERSPEECH, 2018]

“Two stages” is better

“One stage” is better

Indistinguishable

“Projection” is better

“Cycle GAN” is better

Indistinguishable

Demo

Target:

Source:

Source to Target:

https://jjery2243542.github.io/voice_conversion_demo/

Thanks Ju-chieh Chou for providing the results.

Decoder


B Hello

AHello

AEncoder

Source SpeakerTarget

Speaker

Source Speaker

Target Speaker

Me

https://jjery2243542.github.io/voice_conversion_demo/

(Never seen during training!)

Me

Me

Source to Target

Thanks Ju-chieh Chou for providing the results.

Me


AudioNot Paired

This is unsupervised speech recognition.

Text

Supervised Speech Recognition

https://devopedia.org/images/article/102/9180.1532710057.png

(I believe you have seen similar figures before.)

• Supervised learning needs lots of annotated speech.• However, most of the languages are low resourced.

Speech Recognition in the Future

http://www.parenting.com/article/teach-baby-to-talk

Learning human language with very little supervision

Unsupervised Speech Recognition

• Machine learns to recognize speech from unparallel speech and text.

audio collection(without text annotation)

text documents(not parallel to audio)

This idea was too crazy to be realized in the past.

However, it becomes possible with GAN recently.

[Liu, et al., INTERSPEECH, 2018]

[Chen, et al., arXiv, 2018]

Acoustic Token Discovery

Acoustic tokens: chunks of acoustically similar audio segments with token IDs [Zhang & Glass, ASRU 09]

[Huijbregts, ICASSP 11][Chan & Lee, Interspeech 11]

Acoustic tokens can be discovered from audio collection without text annotation.


Token 1

Token 1

Token 1Token 2

Token 3

Token 3

Token 3

Acoustic tokens: chunks of acoustically similar audio segments with token IDs [Zhang & Glass, ASRU 09]

[Huijbregts, ICASSP 11][Chan & Lee, Interspeech 11]

Acoustic tokens can be discovered from audio collection without text annotation.

Token 2

Token 4


Phonetic-level acoustic tokens are obtained by segmental sequence-to-sequence autoencoder.

[Wang, et al., ICASSP, 2018]

Unsupervised Speech Recognition

AY L AH V Y UW

G UH D B AY

HH AW AA R Y UW

T AY W AA N

AY M F AY NCycleGAN

“AY”=

Phone-level Acoustic Pattern Discovery

p1

p1 p3 p2

p1 p4 p3 p5 p5

p1 p5 p4 p3

p1 p2 p3 p4

Phoneme sequences from Text

[Liu, et al., INTERSPEECH,

2018]

[Chen, et al., arXiv, 2018]

The progress of supervised learning

Acc

ura

cy

Unsupervised learning today (2019) is as good as supervised learning 30 years ago.

The image is modified from: Phone recognition on the TIMIT database Lopes, C. and Perdigão, F., 2011. Speech Technologies, Vol 1, pp. 285--302.

Concluding Remarks




To Learn More …

https://www.youtube.com/playlist?list=PLJV_el3uVTsMd2G9ZjcpJn1YfnM9wVOBf

You can learn more from the YouTube Channel

(in Mandarin)

Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative...

Documents

Transcript of Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative...