Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative...

147
Generative Adversarial Network and its Applications to Speech Processing and Natural Language Processing Hung-yi Lee

Transcript of Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative...

Page 1: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Generative Adversarial Network and its Applications to Speech Processing

and Natural Language Processing

Hung-yi Lee

Page 2: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Mihaela Rosca, Balaji Lakshminarayanan, David Warde-Farley, Shakir Mohamed, “Variational Approaches for Auto-Encoding Generative Adversarial Networks”, arXiv, 2017

All Kinds of GAN … https://github.com/hindupuravinash/the-gan-zoo

GAN

ACGAN

BGAN

DCGAN

EBGAN

fGAN

GoGAN

CGAN

……

Page 3: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Outline

Part I: General Introduction of Generative Adversarial Network (GAN)

Part II: Applications to Speech Processing

Part III: Applications to Natural Language Processing

Page 4: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Outline of Part 1

Generation by GAN

Conditional Generation

Unsupervised Conditional Generation

Relation to Reinforcement Learning

Page 5: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Outline of Part 1

Generation by GAN

• Image Generation as Example

• Theory behind GAN

• Issues and Possible Solutions

Conditional Generation

Unsupervised Conditional Generation

Relation to Reinforcement Learning

Page 6: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Anime Face Generation

Draw

Generator

Examples

Page 7: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Basic Idea of GAN

Generator

It is a neural network (NN), or a function.

Generator

0.1−3⋮2.40.9

imagevector

Generator

3−3⋮2.40.9

Generator

0.12.1⋮5.40.9

Generator

0.1−3⋮2.43.5

high dimensional vector

Powered by: http://mattya.github.io/chainer-DCGAN/

Each dimension of input vector represents some characteristics.

Longer hair

blue hair Open mouth

Page 8: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Discri-minator

scalarimage

Basic Idea of GAN It is a neural network (NN), or a function.

Larger value means real, smaller value means fake.

Discri-minator

Discri-minator

Discri-minator1.0 1.0

0.1 Discri-minator

0.1

Page 9: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

• Initialize generator and discriminator

• In each training iteration:

DG

sample

generated objects

G

Algorithm

D

Update

vector

vector

vector

vector

0000

1111

randomly sampled

Database

Step 1: Fix generator G, and update discriminator D

Discriminator learns to assign high scores to real objects and low scores to generated objects.

Fix

Page 10: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

• Initialize generator and discriminator

• In each training iteration:

DG

Algorithm

Step 2: Fix discriminator D, and update generator G

Discri-minator

NNGenerator

vector

0.13

hidden layer

update fix

Gradient Ascent

large network

Generator learns to “fool” the discriminator

Page 11: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

• Initialize generator and discriminator

• In each training iteration:

DG

Learning D

Sample some real objects:

Generate some fake objects:

G

Algorithm

D

Update

Learning G

G Dimage

1111

imageimage

image1

update fix

0000

vector

vector

vector

vector

vector

vector

vector

vector

fix

Page 12: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

The faces generated by machine.

The images are generated by Yen-Hao Chen, Po-Chun Chien, Jun-Chen Xie, Tsung-Han Wu.

Page 13: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

0.00.0

G

0.90.9

G

0.10.1

G

0.20.2

G

0.30.3

G

0.40.4

G

0.50.5

G

0.60.6

G

0.70.7

G

0.80.8

G

Page 14: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Amazing Results! [Tero Karras, et al., ICLR, 2018]

Page 15: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Amazing Results! [Andrew Brock, et al., arXiv, 2018]

Page 16: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

(Variational) Auto-encoder

As close as possible

NNEncoder

NNDecoder

cod

e

NNDecoder

cod

e

Randomly generate a vector as code Image ?

= Generator

= Generator

Page 17: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Auto-encoder v.s. GAN

As close as possibleNNDecoder

cod

e

Genera-tor

cod

e

= Generator

Discri-minator

If discriminator does not simply memorize the images,

Generator learns the patterns of faces.

Just copy an image

1

Auto-encoder

GAN

Fuzzy …

Page 18: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

FID[Martin Heusel, et al., NIPS, 2017]: Smaller is better

[Mario Lucic, et al. arXiv, 2017]

Page 19: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Outline of Part 1

Generation

• Image Generation as Example

• Theory behind GAN

• Issues and Possible Solutions

Conditional Generation

Unsupervised Conditional Generation

Relation to Reinforcement Learning

Page 20: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Generator

• A generator G is a network. The network defines a probability distribution 𝑃𝐺

generator G𝑧 𝑥 = 𝐺 𝑧

Normal Distribution

𝑃𝐺(𝑥) 𝑃𝑑𝑎𝑡𝑎 𝑥

as close as possible

How to compute the divergence?

𝐺∗ = 𝑎𝑟𝑔min𝐺

𝐷𝑖𝑣 𝑃𝐺 , 𝑃𝑑𝑎𝑡𝑎Divergence between distributions 𝑃𝐺 and 𝑃𝑑𝑎𝑡𝑎

𝑥: an image (a high-dimensional vector)

Page 21: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Discriminator𝐺∗ = 𝑎𝑟𝑔min

𝐺𝐷𝑖𝑣 𝑃𝐺 , 𝑃𝑑𝑎𝑡𝑎

Although we do not know the distributions of 𝑃𝐺 and 𝑃𝑑𝑎𝑡𝑎, we can sample from them.

sample

G

vector

vector

vector

vector

sample from normal

Database

Sampling from 𝑷𝑮

Sampling from 𝑷𝒅𝒂𝒕𝒂

Page 22: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Discriminator 𝐺∗ = 𝑎𝑟𝑔min𝐺

𝐷𝑖𝑣 𝑃𝐺 , 𝑃𝑑𝑎𝑡𝑎

Discriminator

: data sampled from 𝑃𝑑𝑎𝑡𝑎: data sampled from 𝑃𝐺

train

𝑉 𝐺,𝐷 = 𝐸𝑥∼𝑃𝑑𝑎𝑡𝑎 𝑙𝑜𝑔𝐷 𝑥 + 𝐸𝑥∼𝑃𝐺 𝑙𝑜𝑔 1 − 𝐷 𝑥

Example Objective Function for D

(G is fixed)

𝐷∗ = 𝑎𝑟𝑔max𝐷

𝑉 𝐷, 𝐺Training:

Using the example objective function is exactly the same as training a binary classifier.

[Goodfellow, et al., NIPS, 2014]

The maximum objective value is related to JS divergence.

Page 23: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Discriminator 𝐺∗ = 𝑎𝑟𝑔min𝐺

𝐷𝑖𝑣 𝑃𝐺 , 𝑃𝑑𝑎𝑡𝑎

Discriminator

: data sampled from 𝑃𝑑𝑎𝑡𝑎: data sampled from 𝑃𝐺

train

hard to discriminatesmall divergence

Discriminatortrain

easy to discriminatelarge divergence

𝐷∗ = 𝑎𝑟𝑔max𝐷

𝑉 𝐷, 𝐺

Training:

Small max𝐷

𝑉 𝐷, 𝐺

Page 24: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

𝐺∗ = 𝑎𝑟𝑔min𝐺

𝐷𝑖𝑣 𝑃𝐺 , 𝑃𝑑𝑎𝑡𝑎max𝐷

𝑉 𝐺,𝐷

The maximum objective value is related to JS divergence.

• Initialize generator and discriminator

• In each training iteration:

Step 1: Fix generator G, and update discriminator D

Step 2: Fix discriminator D, and update generator G

𝐷∗ = 𝑎𝑟𝑔max𝐷

𝑉 𝐷, 𝐺

[Goodfellow, et al., NIPS, 2014]

Page 25: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Using the divergence you like ☺[Sebastian Nowozin, et al., NIPS, 2016]

Can we use other divergence?

Page 26: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Outline of Part 1

Generation

• Image Generation as Example

• Theory behind GAN

• Issues and Possible Solutions

Conditional Generation

Unsupervised Conditional Generation

Relation to Reinforcement Learning

More tips and tricks:https://github.com/soumith/ganhacks

Page 27: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

GAN is hard to train ……

• There is a saying ……

(I found this joke from 陳柏文’s facebook.)

Page 28: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

JS divergence is not suitable

• In most cases, 𝑃𝐺 and 𝑃𝑑𝑎𝑡𝑎 are not overlapped.

• 1. The nature of data

• 2. Sampling

Both 𝑃𝑑𝑎𝑡𝑎 and 𝑃𝐺 are low-dim manifold in high-dim space.

𝑃𝑑𝑎𝑡𝑎𝑃𝐺

The overlap can be ignored.

Even though 𝑃𝑑𝑎𝑡𝑎 and 𝑃𝐺have overlap.

If you do not have enough sampling ……

Page 29: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

𝑃𝑑𝑎𝑡𝑎𝑃𝐺0 𝑃𝑑𝑎𝑡𝑎𝑃𝐺1

𝐽𝑆 𝑃𝐺0 , 𝑃𝑑𝑎𝑡𝑎= 𝑙𝑜𝑔2

𝑃𝑑𝑎𝑡𝑎𝑃𝐺100

……

𝐽𝑆 𝑃𝐺1 , 𝑃𝑑𝑎𝑡𝑎= 𝑙𝑜𝑔2

𝐽𝑆 𝑃𝐺100 , 𝑃𝑑𝑎𝑡𝑎= 0

What is the problem of JS divergence?

……

JS divergence is log2 if two distributions do not overlap.

Intuition: If two distributions do not overlap, binary classifier achieves 100% accuracy

Equally bad

Same objective value is obtained. Same divergence

Page 30: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Wasserstein distance

• Considering one distribution P as a pile of earth, and another distribution Q as the target

• The average distance the earth mover has to move the earth.

𝑃 𝑄

d

𝑊 𝑃,𝑄 = 𝑑

Page 31: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Wasserstein distance

Source of image: https://vincentherrmann.github.io/blog/wasserstein/

𝑃

𝑄

Using the “moving plan” with the smallest average distance to define the Wasserstein distance.

There are many possible “moving plans”.

Smaller distance?

Larger distance?

Page 32: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

𝑃𝑑𝑎𝑡𝑎𝑃𝐺0 𝑃𝑑𝑎𝑡𝑎𝑃𝐺1

𝐽𝑆 𝑃𝐺0 , 𝑃𝑑𝑎𝑡𝑎= 𝑙𝑜𝑔2

𝑃𝑑𝑎𝑡𝑎𝑃𝐺100

……

𝐽𝑆 𝑃𝐺1 , 𝑃𝑑𝑎𝑡𝑎= 𝑙𝑜𝑔2

𝐽𝑆 𝑃𝐺100 , 𝑃𝑑𝑎𝑡𝑎= 0

What is the problem of JS divergence?

𝑊 𝑃𝐺0 , 𝑃𝑑𝑎𝑡𝑎= 𝑑0

𝑊 𝑃𝐺1 , 𝑃𝑑𝑎𝑡𝑎= 𝑑1

𝑊 𝑃𝐺100 , 𝑃𝑑𝑎𝑡𝑎= 0

𝑑0 𝑑1

……

……

Better!

Page 33: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

WGAN

𝑉 𝐺,𝐷= max

𝐷∈1−𝐿𝑖𝑝𝑠𝑐ℎ𝑖𝑡𝑧𝐸𝑥~𝑃𝑑𝑎𝑡𝑎 𝐷 𝑥 − 𝐸𝑥~𝑃𝐺 𝐷 𝑥

Evaluate wasserstein distance between 𝑃𝑑𝑎𝑡𝑎 and 𝑃𝐺

[Martin Arjovsky, et al., arXiv, 2017]

How to fulfill this constraint?D has to be smooth enough.

real

−∞

generated

D

∞Without the constraint, the training of D will not converge.

Keeping the D smooth forces D(x) become ∞ and −∞

Page 34: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

• Original WGAN → Weight Clipping [Martin Arjovsky, et al., arXiv, 2017]

• Improved WGAN → Gradient Penalty [Ishaan Gulrajani, NIPS, 2017]

• Spectral Normalization → Keep gradient norm smaller than 1 everywhere [Miyato, et al., ICLR, 2018]

Force the parameters w between c and -c

After parameter update, if w > c, w = c; if w < -c, w = -c

Keep the gradient close to 1

𝑉 𝐺,𝐷 = max𝐷∈1−𝐿𝑖𝑝𝑠𝑐ℎ𝑖𝑡𝑧

𝐸𝑥~𝑃𝑑𝑎𝑡𝑎 𝐷 𝑥 − 𝐸𝑥~𝑃𝐺 𝐷 𝑥

real

samples

Keep the gradient close to 1

[Kodali, et al., arXiv, 2017][Wei, et al., ICLR, 2018]

Page 35: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Energy-based GAN (EBGAN)

• Using an autoencoder as discriminator D

Discriminator

0 for the best images

Generator is the same.

-0.1

EN DE

Autoencoder

X -1 -0.1

[Junbo Zhao, et al., arXiv, 2016]

➢Using the negative reconstruction error of auto-encoder to determine the goodness

➢Benefit: The auto-encoder can be pre-train by real images without generator.

Page 36: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Tip: Improve Quality during Testing

generator G

Normal Distribution

generator G

Smaller Variance

本技巧由柯達方提供

Some samples are poor.

The output would be more stable, but sacrifice the diversity.

This tip is also used in [Andrew Brock, et al., arXiv, 2018]

Page 37: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Mode Collapse

: real data

: generated data

Training with too many iterations ……

Page 38: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Mode Dropping

BEGAN on CelebA

Generator at iteration t

Generator at iteration t+1

Generator at iteration t+2

Generator switches mode during training

Page 39: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Tip: Ensemble

Generator1

Generator2

……

……

Train a set of generators: 𝐺1, 𝐺2, ⋯ , 𝐺𝑁

Random pick a generator 𝐺𝑖, and then use 𝐺𝑖 to generate the image

To generate an image

Page 40: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Objective Evaluation

Off-the-shelf Image Classifier

𝑥 𝑃 𝑦|𝑥

Concentrated distribution means higher visual quality

CNN𝑥1 𝑃 𝑦1|𝑥1

Uniform distribution means higher variety

CNN𝑥2 𝑃 𝑦2|𝑥2

CNN𝑥3 𝑃 𝑦3|𝑥3

𝑃 𝑦 =1

𝑁

𝑛

𝑃 𝑦𝑛|𝑥𝑛

[Tim Salimans, et al., NIPS, 2016]

𝑥: image

𝑦: class (output of CNN)

e.g. Inception net, VGG, etc.

class 1

class 2

class 3

Page 41: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Objective Evaluation

=

𝑥

𝑦

𝑃 𝑦|𝑥 𝑙𝑜𝑔𝑃 𝑦|𝑥

𝑦

𝑃 𝑦 𝑙𝑜𝑔𝑃 𝑦

Negative entropy of P(y|x)

Entropy of P(y)

Inception Score

𝑃 𝑦 =1

𝑁

𝑛

𝑃 𝑦𝑛|𝑥𝑛𝑃 𝑦|𝑥

class 1

class 2

class 3

[Tim Salimans, et al., NIPS 2016]

Page 42: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Outline of Part 1

Generation

Conditional Generation

Unsupervised Conditional Generation

Relation to Reinforcement Learning

Page 43: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

• Original Generator

• Conditional Generator

G

condition c

𝑧 𝑥 = 𝐺 𝑧

Normal Distribution G

𝑃𝐺 𝑥 → 𝑃𝑑𝑎𝑡𝑎 𝑥

𝑧

Normal Distribution 𝑥 = 𝐺 𝑐, 𝑧

𝑃𝐺 𝑥|𝑐 → 𝑃𝑑𝑎𝑡𝑎 𝑥|𝑐

[Mehdi Mirza, et al., arXiv, 2014]

NNGenerator

“Girl with red hair and red eyes”

“Girl with yellow ribbon”

e.g. Text-to-Image

Page 44: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Target of NN output

Text-to-Image

• Traditional supervised approach

NN Image

Text: “train”

a dog is running

a bird is flying

A blurry image!

c1: a dog is running

as close as possible

Page 45: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Conditional GAN

D (original)

scalar𝑥

G𝑧Normal distribution

x = G(c,z)c: train

x is real image or not

Image

Real images:

Generated images:

1

0

Generator will learn to generate realistic images ….

But completely ignore the input conditions.

[Scott Reed, et al, ICML, 2016]

Page 46: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Conditional GAN

D (better)

scalar𝑐

𝑥

True text-image pairs:

G𝑧Normal distribution

x = G(c,z)c: train

Image

x is realistic or not + c and x are matched or not

(train , )

(train , )(cat , )

[Scott Reed, et al, ICML, 2016]

1

00

Page 47: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

x is realistic or not + c and x are matched or not

Conditional GAN - Discriminator

[Takeru Miyato, et al., ICLR, 2018]

[Han Zhang, et al., arXiv, 2017]

[Augustus Odena et al., ICML, 2017]

condition c

object x

NetworkNetwork

Network

score

Network

Network

(almost every paper)

condition c

object x

c and x are matched or not

x is realistic or not

Page 48: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Conditional GAN

paired data

blue eyesred hairshort hair

Collecting anime faces and the description of its characteristics

red hair,green eyes

blue hair,red eyes

The images are generated by Yen-Hao Chen, Po-Chun Chien, Jun-Chen Xie, Tsung-Han Wu.

Page 49: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Conditional GAN - Image-to-image

G𝑧

x = G(c,z)𝑐

[Phillip Isola, et al., CVPR, 2017]

Image translation, or pix2pix

Page 50: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

as close as possible

Conditional GAN - Image-to-image

• Traditional supervised approach

NN Image

It is blurry.

Testing:

input L1

e.g. L1

[Phillip Isola, et al., CVPR, 2017]

Page 51: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Conditional GAN - Image-to-image

Testing:

input L1 GAN

G𝑧

Image D scalar

GAN + L1

L1

[Phillip Isola, et al., CVPR, 2017]

Page 52: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Conditional GAN - Video Generation

Generator

Discriminator

Last frame is real or generated

Discriminator thinks it is real

[Michael Mathieu, et al., arXiv, 2015]

Page 53: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

https://github.com/dyelax/Adversarial_Video_Generation

Page 54: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Conditional GAN - Sound-to-image

Gc: sound Image

"a dog barking sound"

Training Data Collection

video

Page 55: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Conditional GAN - Sound-to-image• Audio-to-image

https://wjohn1483.github.io/audio_to_scene/index.html

The images are generated by Chia-Hung Wan and Shun-Po Chuang.

Louder

Page 56: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Conditional GAN - Image-to-label

Multi-label Image Classifier = Conditional Generator

Input condition

Generated output

Page 57: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Conditional GAN - Image-to-labelF1 MS-COCO NUS-WIDE

VGG-16 56.0 33.9

+ GAN 60.4 41.2

Inception 62.4 53.5

+GAN 63.8 55.8

Resnet-101 62.8 53.1

+GAN 64.0 55.4

Resnet-152 63.3 52.1

+GAN 63.9 54.1

Att-RNN 62.1 54.7

RLSD 62.0 46.9

The classifiers can have different architectures.

The classifiers are trained as conditional GAN.

[Tsai, et al., submitted to ICASSP 2019]

Page 58: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Conditional GAN - Image-to-labelF1 MS-COCO NUS-WIDE

VGG-16 56.0 33.9

+ GAN 60.4 41.2

Inception 62.4 53.5

+GAN 63.8 55.8

Resnet-101 62.8 53.1

+GAN 64.0 55.4

Resnet-152 63.3 52.1

+GAN 63.9 54.1

Att-RNN 62.1 54.7

RLSD 62.0 46.9

The classifiers can have different architectures.

The classifiers are trained as conditional GAN.

Conditional GAN outperforms other models designed for multi-label.

Page 59: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Domain Adversarial Training

• Training and testing data are in different domains

Training data:

Testing data:

Generator

Generator

The same distribution

feature

feature

Take digit classification as example

Page 60: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

blue points

red points

Domain Adversarial Training

feature extractor (Generator)

Discriminator(Domain classifier)

image Which domain?

Always output zero vectors

Domain Classifier Fails

Page 61: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Domain Adversarial Training

feature extractor (Generator)

Discriminator(Domain classifier)

image

Label predictor

Which digits?

Not only cheat the domain classifier, but satisfying label predictor at the same time

More speech-related applications in Part II.

Successfully applied on image classification [Ganin et al, ICML, 2015][Ajakan et al. JMLR, 2016 ]

Which domain?

Page 62: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Outline of Part 1

Generation

Conditional Generation

Unsupervised Conditional Generation

Relation to Reinforcement Learning

Page 63: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Unsupervised Conditional GAN

GObject in Domain X Object in Domain Y

Transform an object from one domain to another without paired data

Domain X Domain Y

photos

Condition Generated Object

Vincent van Gogh’s paintings

Not Paired

More Applications in Parts II and III

Use image style transfer as example here

Page 64: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Unsupervised Conditional Generation• Approach 1: Direct Transformation

• Approach 2: Projection to Common Space

?𝐺𝑋→𝑌

Domain X Domain Y

For texture or color change

𝐸𝑁𝑋 𝐷𝐸𝑌

Encoder of domain X

Decoder of domain Y

Larger change, only keep the semantics

Domain YDomain X FaceAttribute

Page 65: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

?

Direct Transformation

𝐺𝑋→𝑌

Domain X

Domain Y

𝐷𝑌

Domain Y

Domain X

scalar

Input image belongs to domain Y or not

Become similar to domain Y

Page 66: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Direct Transformation

𝐺𝑋→𝑌

Domain X

Domain Y

𝐷𝑌

Domain Y

Domain X

scalar

Input image belongs to domain Y or not

Become similar to domain Y

Not what we want!

ignore input

Page 67: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Direct Transformation

𝐺𝑋→𝑌

Domain X

Domain Y

𝐷𝑌

Domain X

scalar

Input image belongs to domain Y or not

Become similar to domain Y

Not what we want!

ignore input

[Tomer Galanti, et al. ICLR, 2018]

The issue can be avoided by network design.

Simpler generator makes the input and output more closely related.

Page 68: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Direct Transformation

𝐺𝑋→𝑌

Domain X

Domain Y

𝐷𝑌

Domain X

scalar

Input image belongs to domain Y or not

Become similar to domain Y

EncoderNetwork

EncoderNetwork

pre-trained

as close as possible

Baseline of DTN [Yaniv Taigman, et al., ICLR, 2017]

Page 69: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Direct Transformation – Cycle GAN

𝐺𝑋→𝑌

𝐷𝑌

Domain Y

scalar

Input image belongs to domain Y or not

𝐺Y→X

as close as possible

Lack of information for reconstruction

[Jun-Yan Zhu, et al., ICCV, 2017]

Cycle consistency

Page 70: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Direct Transformation – Cycle GAN

𝐺𝑋→𝑌 𝐺Y→X

as close as possible

𝐺Y→X 𝐺𝑋→𝑌

as close as possible

𝐷𝑌𝐷𝑋scalar: belongs to domain Y or not

scalar: belongs to domain X or not

Page 71: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Cycle GAN

Dual GAN

Disco GAN

[Jun-Yan Zhu, et al., ICCV, 2017]

[Zili Yi, et al., ICCV, 2017]

[Taeksoo Kim, et

al., ICML, 2017]

For multiple domains, considering starGAN

[Yunjey Choi, arXiv, 2017]

Page 72: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Issue of Cycle Consistency

• CycleGAN: a Master of Steganography[Casey Chu, et al., NIPS workshop, 2017]

𝐺Y→X𝐺𝑋→𝑌

The information is hidden.

Page 73: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Unsupervised Conditional Generation• Approach 1: Direct Transformation

• Approach 2: Projection to Common Space

?𝐺𝑋→𝑌

Domain X Domain Y

For texture or color change

𝐸𝑁𝑋 𝐷𝐸𝑌

Encoder of domain X

Decoder of domain Y

Larger change, only keep the semantics

Domain YDomain X FaceAttribute

Page 74: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Domain X Domain Y

𝐸𝑁𝑋

𝐸𝑁𝑌 𝐷𝐸𝑌

𝐷𝐸𝑋image

image

image

imageFaceAttribute

Projection to Common Space

Target

Page 75: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Domain X Domain Y

𝐸𝑁𝑋

𝐸𝑁𝑌 𝐷𝐸𝑌

𝐷𝐸𝑋image

image

image

image

Minimizing reconstruction error

Projection to Common SpaceTraining

Page 76: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

𝐸𝑁𝑋

𝐸𝑁𝑌 𝐷𝐸𝑌

𝐷𝐸𝑋image

image

image

image

Minimizing reconstruction error

Because we train two auto-encoders separately …

The images with the same attribute may not project to the same position in the latent space.

𝐷𝑋

𝐷𝑌

Discriminator of X domain

Discriminator of Y domain

Minimizing reconstruction error

Projection to Common SpaceTraining

Page 77: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Sharing the parameters of encoders and decoders

Projection to Common SpaceTraining

𝐸𝑁𝑋

𝐸𝑁𝑌

𝐷𝐸𝑋

𝐷𝐸𝑌

Couple GAN[Ming-Yu Liu, et al., NIPS, 2016]

UNIT[Ming-Yu Liu, et al., NIPS, 2017]

Page 78: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

𝐸𝑁𝑋

𝐸𝑁𝑌 𝐷𝐸𝑌

𝐷𝐸𝑋image

image

image

image

Minimizing reconstruction error

The domain discriminator forces the output of 𝐸𝑁𝑋 and 𝐸𝑁𝑌 have the same distribution.

From 𝐸𝑁𝑋 or 𝐸𝑁𝑌

𝐷𝑋

𝐷𝑌

Discriminator of X domain

Discriminator of Y domain

Projection to Common SpaceTraining

DomainDiscriminator

𝐸𝑁𝑋 and 𝐸𝑁𝑌 fool the domain discriminator

[Guillaume Lample, et al., NIPS, 2017]

Page 79: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

𝐸𝑁𝑋

𝐸𝑁𝑌 𝐷𝐸𝑌

𝐷𝐸𝑋image

image

image

image

𝐷𝑋

𝐷𝑌

Discriminator of X domain

Discriminator of Y domain

Projection to Common SpaceTraining

Cycle Consistency:

Used in ComboGAN [Asha Anoosheh, et al., arXiv, 017]

Minimizing reconstruction error

Page 80: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

𝐸𝑁𝑋

𝐸𝑁𝑌 𝐷𝐸𝑌

𝐷𝐸𝑋image

image

image

image

𝐷𝑋

𝐷𝑌

Discriminator of X domain

Discriminator of Y domain

Projection to Common SpaceTraining

Semantic Consistency:

Used in DTN [Yaniv Taigman, et al., ICLR, 2017] and XGAN [Amélie Royer, et al., arXiv, 2017]

To the same latent space

Page 81: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Outline of Part 1

Generation

Conditional Generation

Unsupervised Conditional Generation

Relation to Reinforcement Learning

Page 82: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Basic Components

EnvActorReward

Function

VideoGame

Go

Get 20 scores when killing a monster

The rule of GO

You cannot control

Page 83: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

• Input of neural network: the observation of machine represented as a vector or a matrix

• Output neural network : each action corresponds to a neuron in output layer

……

NN as actor

pixelsfire

right

left

Score of an action

0.7

0.2

0.1

Take the action based on the probability.

Neural network as Actor

Page 84: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Actor, Environment, Reward

𝜏 = 𝑠1, 𝑎1, 𝑠2, 𝑎2, ⋯ , 𝑠𝑇 , 𝑎𝑇

Trajectory

Actor

𝑠1

𝑎1

Env

𝑠2

Env

𝑠1

𝑎1

Actor

𝑠2

𝑎2

Env

𝑠3

𝑎2

……

“right” “fire”

Page 85: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Reward Function → Discriminator

Reinforcement Learning v.s. GAN

Actor

𝑠1

𝑎1

Env

𝑠2

Env

𝑠1

𝑎1

Actor

𝑠2

𝑎2

Env

𝑠3

𝑎2

……

𝑅 𝜏 =

𝑡=1

𝑇

𝑟𝑡

Reward

𝑟1

Reward

𝑟2

“Black box”You cannot use backpropagation.

Actor → Generator Fixed

updatedupdated

Page 86: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Imitation Learning

We have demonstration of the expert.

Actor

𝑠1

𝑎1

Env

𝑠2

Env

𝑠1

𝑎1

Actor

𝑠2

𝑎2

Env

𝑠3

𝑎2

……

reward function is not available

Ƹ𝜏1, Ƹ𝜏2, ⋯ , Ƹ𝜏𝑁Each Ƹ𝜏 is a trajectory of the expert.

Self driving: record human drivers

Robot: grab the arm of robot

Page 87: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Inverse Reinforcement Learning

Reward Function

Environment

Optimal Actor

Inverse Reinforcement Learning

➢Using the reward function to find the optimal actor.

➢Modeling reward can be easier. Simple reward function can lead to complex policy.

Reinforcement Learning

Expert

demonstration of the expert

Ƹ𝜏1, Ƹ𝜏2, ⋯ , Ƹ𝜏𝑁

Page 88: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Framework of IRL

Expert ො𝜋

Actor 𝜋

ObtainReward Function R

Ƹ𝜏1, Ƹ𝜏2, ⋯ , Ƹ𝜏𝑁

𝜏1, 𝜏2, ⋯ , 𝜏𝑁

Find an actor based on reward function R

By Reinforcement learning

𝑛=1

𝑁

𝑅 Ƹ𝜏𝑛 >

𝑛=1

𝑁

𝑅 𝜏

Reward function → Discriminator

Actor → Generator

Reward Function R

The expert is always the best.

Page 89: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

𝜏1, 𝜏2, ⋯ , 𝜏𝑁

GAN

IRL

G

D

High score for real, low score for generated

Find a G whose output obtains large score from D

Ƹ𝜏1, Ƹ𝜏2, ⋯ , Ƹ𝜏𝑁Expert

Actor

Reward Function

Larger reward for Ƹ𝜏𝑛,Lower reward for 𝜏

Find a Actor obtains large reward

Page 90: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Concluding Remarks

Generation

Conditional Generation

Unsupervised Conditional Generation

Relation to Reinforcement Learning

Page 91: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Reference

• Generation • Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-

Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio, Generative Adversarial Networks, NIPS, 2014

• Sebastian Nowozin, Botond Cseke, Ryota Tomioka, “f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization”, NIPS, 2016

• Martin Arjovsky, Soumith Chintala, Léon Bottou, Wasserstein GAN, arXiv, 2017

• Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, Aaron Courville, Improved Training of Wasserstein GANs, NIPS, 2017

• Junbo Zhao, Michael Mathieu, Yann LeCun, Energy-based Generative Adversarial Network, arXiv, 2016

• Mario Lucic, Karol Kurach, Marcin Michalski, Sylvain Gelly, Olivier Bousquet, “Are GANs Created Equal? A Large-Scale Study”, arXiv, 2017

• Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, Xi Chen Improved Techniques for Training GANs, NIPS, 2016

• Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, Sepp Hochreiter, GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium, NIPS, 2017

Page 92: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Reference

• Generation • Naveen Kodali, Jacob Abernethy, James Hays, Zsolt Kira, “On Convergence

and Stability of GANs”, arXiv, 2017

• Xiang Wei, Boqing Gong, Zixia Liu, Wei Lu, Liqiang Wang, Improving the Improved Training of Wasserstein GANs: A Consistency Term and Its Dual Effect, ICLR, 2018

• Takeru Miyato, Toshiki Kataoka, Masanori Koyama, Yuichi Yoshida, Spectral Normalization for Generative Adversarial Networks, ICLR, 2018

• Tero Karras, Timo Aila, Samuli Laine, Jaakko Lehtinen, Progressive Growing of GANs for Improved Quality, Stability, and Variation, ICLR, 2018

• Andrew Brock, Jeff Donahue, Karen Simonyan, Large Scale GAN Training for High Fidelity Natural Image Synthesis, arXiv, 2018

Page 93: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Reference

• Conditional Generation• Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt

Schiele, Honglak Lee, Generative Adversarial Text to Image Synthesis, ICML, 2016

• Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, Alexei A. Efros, Image-to-Image Translation with Conditional Adversarial Networks, CVPR, 2017

• Michael Mathieu, Camille Couprie, Yann LeCun, Deep multi-scale video prediction beyond mean square error, arXiv, 2015

• Mehdi Mirza, Simon Osindero, Conditional Generative Adversarial Nets, arXiv, 2014

• Takeru Miyato, Masanori Koyama, cGANs with Projection Discriminator, ICLR, 2018

• Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, XiaoleiHuang, Dimitris Metaxas, StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks, arXiv, 2017

• Augustus Odena, Christopher Olah, Jonathon Shlens, Conditional Image Synthesis With Auxiliary Classifier GANs, ICML, 2017

Page 94: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Reference

• Conditional Generation• Yaroslav Ganin, Victor Lempitsky, Unsupervised Domain Adaptation by

Backpropagation, ICML, 2015

• Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, Domain-Adversarial Training of Neural Networks, JMLR, 2016

• Che-Ping Tsai, Hung-Yi Lee, Adversarial Learning of Label Dependency: A Novel Framework for Multi-class Classification, submitted to ICASSP 2019

Page 95: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Reference

• Unsupervised Conditional Generation• Jun-Yan Zhu, Taesung Park, Phillip Isola, Alexei A. Efros, Unpaired Image-to-

Image Translation using Cycle-Consistent Adversarial Networks, ICCV, 2017

• Zili Yi, Hao Zhang, Ping Tan, Minglun Gong, DualGAN: Unsupervised Dual Learning for Image-to-Image Translation, ICCV, 2017

• Tomer Galanti, Lior Wolf, Sagie Benaim, The Role of Minimal Complexity Functions in Unsupervised Learning of Semantic Mappings, ICLR, 2018

• Yaniv Taigman, Adam Polyak, Lior Wolf, Unsupervised Cross-Domain Image Generation, ICLR, 2017

• Asha Anoosheh, Eirikur Agustsson, Radu Timofte, Luc Van Gool, ComboGAN: Unrestrained Scalability for Image Domain Translation, arXiv, 2017

• Amélie Royer, Konstantinos Bousmalis, Stephan Gouws, Fred Bertsch, Inbar Mosseri, Forrester Cole, Kevin Murphy, XGAN: Unsupervised Image-to-Image Translation for Many-to-Many Mappings, arXiv, 2017

Page 96: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Reference

• Unsupervised Conditional Generation• Guillaume Lample, Neil Zeghidour, Nicolas Usunier, Antoine Bordes, Ludovic

Denoyer, Marc'Aurelio Ranzato, Fader Networks: Manipulating Images by Sliding Attributes, NIPS, 2017

• Taeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jung Kwon Lee, Jiwon Kim, Learning to Discover Cross-Domain Relations with Generative Adversarial Networks, ICML, 2017

• Ming-Yu Liu, Oncel Tuzel, “Coupled Generative Adversarial Networks”, NIPS, 2016

• Ming-Yu Liu, Thomas Breuel, Jan Kautz, Unsupervised Image-to-Image Translation Networks, NIPS, 2017

• Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, Jaegul Choo, StarGAN: Unified Generative Adversarial Networks for Multi-Domain Image-to-Image Translation, arXiv, 2017

Page 97: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Outline

Part I: General Introduction of Generative Adversarial Network (GAN)

Part II: Applications to Natural Language Processing

Part III: Applications to Speech Processing

Page 98: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Unsupervised Conditional Generation

Image Style Transfer

Text Style Transfer

It is good.

It’s a good day.I love you.

It is bad.

It’s a bad day.

I don’t love you.

positive negative

photos Vincent van Gogh’s paintings

Not Paired

Not Paired

Page 99: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Cycle GAN

𝐺𝑋→𝑌 𝐺Y→X

as close as possible

𝐺Y→X 𝐺𝑋→𝑌

as close as possible

𝐷𝑌𝐷𝑋scalar: belongs to domain Y or not

scalar: belongs to domain X or not

Page 100: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Cycle GAN

𝐺𝑋→𝑌 𝐺Y→X

as close as possible

𝐺Y→X 𝐺𝑋→𝑌

as close as possible

𝐷𝑌𝐷𝑋negative sentence? positive sentence?

It is bad. It is good. It is bad.

I love you. I hate you. I love you.positive

positive

positivenegative

negative negative

Page 101: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Discrete Issue

𝐺𝑋→𝑌

𝐷𝑌 positive sentence?

It is bad. It is good.positivenegative

large network

hidden layer

update

fix

Backpropagation

with discrete output

Seq2seq model

Page 102: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Three Categories of Solutions

Gumbel-softmax

• [Matt J. Kusner, et al, arXiv, 2016]

Continuous Input for Discriminator

• [Sai Rajeswar, et al., arXiv, 2017][Ofir Press, et al., ICML workshop, 2017][Zhen Xu, et al., EMNLP, 2017][Alex Lamb, et al., NIPS, 2016][Yizhe Zhang, et al., ICML, 2017]

“Reinforcement Learning”

• [Yu, et al., AAAI, 2017][Li, et al., EMNLP, 2017][Tong Che, et al, arXiv, 2017][Jiaxian Guo, et al., AAAI, 2018][Kevin Lin, et al, NIPS, 2017][William Fedus, et al., ICLR, 2018]

Page 103: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Cycle GAN

𝐺𝑋→𝑌 𝐺Y→X

as close as possible

𝐺Y→X 𝐺𝑋→𝑌

as close as possible

𝐷𝑌𝐷𝑋negative sentence? positive sentence?

It is bad. It is good. It is bad.

I love you. I hate you. I love you.positive

positive

positivenegative

negative negative

Word embedding[Lee, et al., ICASSP, 2018]

Discrete?

Page 104: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Cycle GAN

• Negative sentence to positive sentence:it's a crappy day → it's a great dayi wish you could be here → you could be hereit's not a good idea → it's good ideai miss you → i love youi don't love you → i love youi can't do that → i can do thati feel so sad → i happyit's a bad day → it's a good dayit's a dummy day → it's a great daysorry for doing such a horrible thing → thanks for doing a great thingmy doggy is sick → my doggy is my doggymy little doggy is sick → my little doggy is my little doggy

Thinks Yau-Shian Wang for providing the results.

Page 105: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

✘ Negative sentence to positive sentence:

Cycle GAN

感謝張瓊之同學提供實驗結果

胃疼 , 沒睡醒 , 各種不舒服 -> 生日快樂 , 睡醒 , 超級舒服

我都想去上班了, 真夠賤的! -> 我都想去睡了, 真帥的 !

暈死了, 吃燒烤、竟然遇到個變態狂 -> 哈哈好 ~ , 吃燒烤 ~ 竟然遇到帥狂

我肚子痛的厲害 -> 我生日快樂厲害

感冒了, 難受的說不出話來了 ! -> 感冒了, 開心的說不出話來 !

Page 106: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

𝐸𝑁𝑋

𝐸𝑁𝑌 𝐷𝐸𝑌

𝐷𝐸𝑋 𝐷𝑋

𝐷𝑌

Discriminator of X domain

Discriminator of Y domain

Projection to Common Space

PositiveSentence

PositiveSentence

NegativeSentence

NegativeSentence

Decoder hidden layer as discriminator input [Shen, et al., NIPS, 2017]

From 𝐸𝑁𝑋 or 𝐸𝑁𝑌Domain

Discriminator

𝐸𝑁𝑋 and 𝐸𝑁𝑌 fool the domain discriminator

[Zhao, et al., ICML 2018]

[Fu, et al., AAAI, 2018]

Page 107: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Unsupervised Conditional Generation

Image Style Transfer

Text Style Transfer

photos Vincent van Gogh’s paintings

Not Paired

Not Paired

summarydocument

This is unsupervised abstractive summarization.

Page 108: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Abstractive Summarization

• Now machine can do abstractive summary by seq2seq (write summaries in its own words)

summary 1

summary 2

summary 3

Training Data

summary

seq2seq(in its own words)

Supervised: We need lots of labelled training data.

Page 109: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Unsupervised Abstractive Summarization• Now machine can do abstractive summary by

seq2seq (write summaries in its own words)

summary 1

summary 2

summary 3

seq2seq document

Domain X Domain Y

Page 110: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

G

Seq2seq

documentword

sequence

D

Human written summaries Real or not

Discriminator

Unsupervised Abstractive Summarization

Summary?

Page 111: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

G

Seq2seq

documentword

sequence

D

Human written summaries Real or not

Discriminator

R

Seq2seq

document

Unsupervised Abstractive Summarization

minimize the reconstruction error

Page 112: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Unsupervised Abstractive Summarization

G RSummary?

Seq2seq Seq2seq

document documentword

sequence

Only need a lot of documents to train the model

This is a seq2seq2seq auto-encoder.

Using a sequence of words as latent representation.

not readable …

Page 113: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Unsupervised Abstractive Summarization

G R

Seq2seq Seq2seq

word sequence

D

Human written summaries Real or not

DiscriminatorLet Discriminator considers

my output as real

document document

Summary?

Readable

REINFORCE algorithm to deal with the discrete issue

Page 114: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Experimental results

ROUGE-1 ROUGE-2 ROUGE-L

Supervised 33.2 14.2 30.5

Trivial 21.9 7.7 20.5

Unsupervised(matched data)

28.1 10.0 25.4

Unsupervised(no matched data)

27.2 9.1 24.1

English Gigaword (Document title as summary)

• Matched data: using the title of English Gigaword to train Discriminator

• No matched data: using the title of CNN/Diary Mail to train Discriminator

Page 115: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Semi-supervised Learning

25

26

27

28

29

30

31

32

33

34

0 10k 500k

RO

UG

E-1

Number of document-summary pairs used

WGAN Reinforce Supervised

Using matched data

3.8M pairs are used.Approaches to deal with the discrete issue.

unsupervised

semi-supervised

Page 116: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Unsupervised Abstractive Summarization• Document:澳大利亞今天與13個國家簽署了反興奮劑雙邊協議,旨在加強體育競賽之外的藥品檢查並共享研究成果 ……

• Summary:

• Human:澳大利亞與13國簽署反興奮劑協議

• Unsupervised:澳大利亞加強體育競賽之外的藥品檢查

• Document:中華民國奧林匹克委員會今天接到一九九二年冬季奧運會邀請函,由於主席張豐緒目前正在中南美洲進行友好訪問,因此尚未決定是否派隊赴賽 ……

• Summary:

• Human:一九九二年冬季奧運會函邀我參加

• Unsupervised:奧委會接獲冬季奧運會邀請函

感謝王耀賢同學提供實驗結果

Page 117: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Unsupervised Abstractive Summarization• Document:據此間媒體27日報道,印度尼西亞蘇門答臘島的兩個省近日來連降暴雨,洪水泛濫導致塌方,到26日為止至少已有60人喪生,100多人失蹤 ……

• Summary:

• Human:印尼水災造成60人死亡

• Unsupervised:印尼門洪水泛濫導致塌雨

• Document:安徽省合肥市最近為領導幹部下基層做了新規定:一律輕車簡從,不準搞迎來送往、不準搞層層陪同 ……

• Summary:

• Human:合肥規定領導幹部下基層活動從簡

• Unsupervised:合肥領導幹部下基層做搞迎來送往規定:一律簡

感謝王耀賢同學提供實驗結果

Page 118: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Outline

Part I: General Introduction of Generative Adversarial Network (GAN)

Part II: Applications to Natural Language Processing

Part III: Applications to Speech Processing

Page 119: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Unsupervised Conditional Generation

Image Style Transfer

Speech Style Transfer

photos Vincent van Gogh’s paintings

Not Paired

Not Paired

This is unsupervised voice conversion.

Speaker A Speaker B

Page 120: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Voice Conversion

Page 121: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Voice Conversion

• The same sentence has different impact when it is said by different people.

Do you want to study a PhD?

Go away! Student

Student

Do you want to study a PhD?新垣結衣

(Aragaki Yui)

Page 122: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

In the past

With GAN

Speaker A Speaker B

How are you? How are you?

Good morning Good morning

Speaker A Speaker B

天氣真好 How are you?

再見囉 Good morning

Speakers A and B are talking about completely different things.

Page 123: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Cycle GAN

𝐺𝑋→𝑌 𝐺Y→X

as close as possible

𝐺Y→X 𝐺𝑋→𝑌

as close as possible

𝐷𝑌𝐷𝑋scalar: belongs to domain Y or not

scalar: belongs to domain X or not

Page 124: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Cycle GANfor Voice Conversion

𝐺𝑋→𝑌 𝐺Y→X

as close as possible

𝐺Y→X 𝐺𝑋→𝑌

as close as possible

𝐷𝑌𝐷𝑋scalar: belongs to domain Y or not

scalar: belongs to domain X or not

X: Speaker A, Y: Speaker B[Takuhiro Kaneko, et. al, arXiv, 2017][Fuming Fang, et. al, ICASSP, 2018][Yang Gao, et. al, ICASSP, 2018]

spectrogram spectrogram

Page 125: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

𝐸𝑁𝑋

𝐸𝑁𝑌 𝐷𝐸𝑌

𝐷𝐸𝑋

Projection to Common Space

Page 126: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

𝐸𝑁𝑋

𝐸𝑁𝑌 𝐷𝐸𝑌

𝐷𝐸𝑋

Projection to Common Space

• All the speakers share the same encoder.

• The model can deal with the speakers never seen during training.

Encoder

Encoder

Page 127: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

𝐸𝑁𝑋

𝐸𝑁𝑌 𝐷𝐸𝑌

𝐷𝐸𝑋

Projection to Common Space

All the speakers also share the same decoder.

Encoder

Encoder Decoder

Decoder

Which speakers?Discriminator

We hope that encoder can extract the phonetic information while removing the speaker information.

The encoder fools the discriminator.

Use a vector (one-hot) to represent speaker identity.

Page 128: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Encoder

Decoder

reconstructed

Training

Testing

Decoder

A is reading the sentence of B

Projection to Common Space

A How are you?

A

Which speakers?

Discriminator

B Hello

AHello

AEncoder

phonetic information

Page 129: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

EncoderHow are you?

Which speakers?

Discriminator

Does it contain phonetic information?

Different colors: different words

Different colors: different speakers

Page 130: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

“Audio” Word to Vector

Page 131: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Encoder

Decoder

reconstructed

Training

Testing

Decoder

A is reading the sentence of B

Issues

A How are you?

A

Which speakers?

Discriminator

B Hello

AHello

AEncoder

Different Speakers

The Same Speakers

Low Quality

Page 132: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

2nd Stage Training

Discriminatorreal or generated?

which speaker?

Cheat discriminator

Speaker

Classifier

Help speaker classifier

DecoderB Hello

A

AEncoder

Extra Criterion for Training

Hello

Different SpeakersNo learning target???

Page 133: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

• Subjective evaluations(20 speakers in VCTK)

Experimental Results[Chou et al., INTERSPEECH, 2018]

“Two stages” is better

“One stage” is better

Indistinguishable

“Projection” is better

“Cycle GAN” is better

Indistinguishable

Page 134: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Demo

Target:

Source:

Source to Target:

https://jjery2243542.github.io/voice_conversion_demo/

Thanks Ju-chieh Chou for providing the results.

Decoder

A is reading the sentence of B

B Hello

AHello

AEncoder

Source SpeakerTarget

Speaker

Page 135: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Source Speaker

Target Speaker

Me

https://jjery2243542.github.io/voice_conversion_demo/

(Never seen during training!)

Me

Me

Source to Target

Thanks Ju-chieh Chou for providing the results.

Me

Page 136: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Unsupervised Conditional Generation

AudioNot Paired

This is unsupervised speech recognition.

Text

Page 137: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Supervised Speech Recognition

https://devopedia.org/images/article/102/9180.1532710057.png

(I believe you have seen similar figures before.)

• Supervised learning needs lots of annotated speech.• However, most of the languages are low resourced.

Page 138: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Speech Recognition in the Future

http://www.parenting.com/article/teach-baby-to-talk

Learning human language with very little supervision

Page 139: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Unsupervised Speech Recognition

• Machine learns to recognize speech from unparallel speech and text.

audio collection(without text annotation)

text documents(not parallel to audio)

This idea was too crazy to be realized in the past.

However, it becomes possible with GAN recently.

[Liu, et al., INTERSPEECH, 2018]

[Chen, et al., arXiv, 2018]

Page 140: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Acoustic Token Discovery

Acoustic tokens: chunks of acoustically similar audio segments with token IDs [Zhang & Glass, ASRU 09]

[Huijbregts, ICASSP 11][Chan & Lee, Interspeech 11]

Acoustic tokens can be discovered from audio collection without text annotation.

Page 141: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Acoustic Token Discovery

Token 1

Token 1

Token 1Token 2

Token 3

Token 3

Token 3

Acoustic tokens: chunks of acoustically similar audio segments with token IDs [Zhang & Glass, ASRU 09]

[Huijbregts, ICASSP 11][Chan & Lee, Interspeech 11]

Acoustic tokens can be discovered from audio collection without text annotation.

Token 2

Token 4

Page 142: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Acoustic Token Discovery

Phonetic-level acoustic tokens are obtained by segmental sequence-to-sequence autoencoder.

[Wang, et al., ICASSP, 2018]

Page 143: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Unsupervised Speech Recognition

AY L AH V Y UW

G UH D B AY

HH AW AA R Y UW

T AY W AA N

AY M F AY NCycleGAN

“AY”=

Phone-level Acoustic Pattern Discovery

p1

p1 p3 p2

p1 p4 p3 p5 p5

p1 p5 p4 p3

p1 p2 p3 p4

Phoneme sequences from Text

[Liu, et al., INTERSPEECH,

2018]

[Chen, et al., arXiv, 2018]

Page 144: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language
Page 145: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

The progress of supervised learning

Acc

ura

cy

Unsupervised learning today (2019) is as good as supervised learning 30 years ago.

The image is modified from: Phone recognition on the TIMIT database Lopes, C. and Perdigão, F., 2011. Speech Technologies, Vol 1, pp. 285--302.

Page 146: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

Concluding Remarks

Part I: General Introduction of Generative Adversarial Network (GAN)

Part II: Applications to Natural Language Processing

Part III: Applications to Speech Processing

Page 147: Generative Adversarial Network - 國立臺灣大學miulab/s107-adl/doc/190430_GAN.pdf · Generative Adversarial Network and its Applications to Speech Processing and Natural Language

To Learn More …

https://www.youtube.com/playlist?list=PLJV_el3uVTsMd2G9ZjcpJn1YfnM9wVOBf

You can learn more from the YouTube Channel

(in Mandarin)