Adversarial Approaches to Bayesian Learning and Bayesian ... · Learning and Bayesian Approaches to...

Adversarial Approaches to Bayesian Learning and Bayesian Approaches

to Adversarial RobustnessIan Goodfellow, OpenAI Research Scientist

NIPS 2016 Workshop on Bayesian Deep Learning Barcelona, 2016-12-10

(Goodfellow 2016)

Speculation on Three Topics

• Can we build a generative adversarial model of the posterior over parameters?

• Adversarial variants of variational Bayes

• Can Bayesian modeling solve adversarial examples?

(Goodfellow 2016)

Generative Modeling• Density estimation

• Sample generation

Training examples Model samples

(Goodfellow 2016)

Adversarial Nets Framework

x sampled from data

Differentiable function D

D(x) tries to be near 1

Input noise z

Differentiable function G

x sampled from model

D

D tries to make D(G(z)) near 0,G tries to make D(G(z)) near 1

(Goodfellow 2016)

Minimax Game

-Equilibrium is a saddle point of the discriminator loss -Resembles Jensen-Shannon divergence -Generator minimizes the log-probability of the discriminator being correct

BRIEF ARTICLE

THE AUTHOR

Maximum likelihood

✓

⇤= argmax

✓

Ex⇠pdata log pmodel

(x | ✓)

Fully-visible belief net

pmodel

(x) = pmodel

(x1

)

nY

i=2

pmodel

(xi

| x1

, . . . , xi�1

)

Change of variables

y = g(x) ) px

(x) = py

(g(x))

��det✓@g(x)

@x

◆��

Variational bound

log p(x) � log p(x)�DKL

(q(z)kp(z | x))(1)

=Ez⇠q

log p(x, z) +H(q)(2)

Boltzmann Machines

p(x) =1

Zexp (�E(x, z))(3)

Z =

X

x

X

z

exp (�E(x, z))(4)

Generator equation

x = G(z;✓

(G)

)

Minimax

J (D)

= �1

2

Ex⇠pdata logD(x)� 1

2

Ez

log (1�D (G(z)))(5)

J (G)

= �J (D)

(6)

1

(Goodfellow 2016)

Discriminator Strategy

✓

⇤= max

✓

1

m

mX

i=1

log p

⇣x

(i); ✓

⌘

p(h, x) =

1

Z

p̃(h, x)

p̃(h, x) = exp (�E (h, x))

Z =

X

h,x

p̃(h, x)

d

d✓

i

log p(x) =

d

d✓

i

"log

X

h

p̃(h, x)� logZ(✓)

#

d

d✓

i

logZ(✓) =

d

d✓iZ(✓)

Z(✓)

p(x, h) = p(x | h(1))p(h

(1) | h(2)) . . . p(h

(L�1) | h(L))p(h

(L))

d

d✓

i

log p(x) =

d

d✓ip(x)

p(x)

p(x) =

X

h

p(x | h)p(h)

D(x) =

p

data

(x)

p

data

(x) + p

model

(x)

1

In other words, D and G play the following two-player minimax game with value function V (G,D):

min

G

max

D

V (D,G) = Ex⇠pdata(x)[logD(x)] + E

z⇠pz(z)[log(1�D(G(z)))]. (1)

In the next section, we present a theoretical analysis of adversarial nets, essentially showing thatthe training criterion allows one to recover the data generating distribution as G and D are givenenough capacity, i.e., in the non-parametric limit. See Figure 1 for a less formal, more pedagogicalexplanation of the approach. In practice, we must implement the game using an iterative, numericalapproach. Optimizing D to completion in the inner loop of training is computationally prohibitive,and on finite datasets would result in overfitting. Instead, we alternate between k steps of optimizingD and one step of optimizing G. This results in D being maintained near its optimal solution, solong as G changes slowly enough. This strategy is analogous to the way that SML/PCD [31, 29]training maintains samples from a Markov chain from one learning step to the next in order to avoidburning in a Markov chain as part of the inner loop of learning. The procedure is formally presentedin Algorithm 1.

In practice, equation 1 may not provide sufficient gradient for G to learn well. Early in learning,when G is poor, D can reject samples with high confidence because they are clearly different fromthe training data. In this case, log(1 � D(G(z))) saturates. Rather than training G to minimizelog(1�D(G(z))) we can train G to maximize logD(G(z)). This objective function results in thesame fixed point of the dynamics of G and D but provides much stronger gradients early in learning.

. . .

(a) (b) (c) (d)

Figure 1: Generative adversarial nets are trained by simultaneously updating the discriminative distribution(D, blue, dashed line) so that it discriminates between samples from the data generating distribution (black,dotted line) p

x

from those of the generative distribution pg (G) (green, solid line). The lower horizontal line isthe domain from which z is sampled, in this case uniformly. The horizontal line above is part of the domainof x. The upward arrows show how the mapping x = G(z) imposes the non-uniform distribution pg ontransformed samples. G contracts in regions of high density and expands in regions of low density of pg . (a)Consider an adversarial pair near convergence: pg is similar to pdata and D is a partially accurate classifier.(b) In the inner loop of the algorithm D is trained to discriminate samples from data, converging to D

⇤(x) =pdata(x)

pdata(x)+pg(x) . (c) After an update to G, gradient of D has guided G(z) to flow to regions that are more likelyto be classified as data. (d) After several steps of training, if G and D have enough capacity, they will reach apoint at which both cannot improve because pg = pdata. The discriminator is unable to differentiate betweenthe two distributions, i.e. D(x) = 1

2 .

4 Theoretical Results

The generator G implicitly defines a probability distribution p

g

as the distribution of the samplesG(z) obtained when z ⇠ p

z

. Therefore, we would like Algorithm 1 to converge to a good estimatorof pdata, if given enough capacity and training time. The results of this section are done in a non-parametric setting, e.g. we represent a model with infinite capacity by studying convergence in thespace of probability density functions.

We will show in section 4.1 that this minimax game has a global optimum for pg

= pdata. We willthen show in section 4.2 that Algorithm 1 optimizes Eq 1, thus obtaining the desired result.

3

Data Model

distribution

Optimal D(x) for any pdata

(x) and pmodel

(x) is always

z

x

Discriminator

Estimating this ratio using supervised learning is

the key approximation mechanism used by GANs

(Goodfellow 2016)

High quality samples from complicated distributions

(Goodfellow 2016)

Speculative idea: generator nets for sampling from the posterior

• Practical obstacle:

• Parameters lie in a much higher dimensional space than observed inputs

• Possible solution:

• Maybe the posterior does not need to be extremely complicated

• HyperNetworks (Ha et al 2016) seem to be able to model a distribution on parameters

(Goodfellow 2016)

Theoretical problems

• A naive application of GANs to generating parameters would require samples of the parameters from the true posterior

• We only have samples of the data that were generated using the true posterior

(Goodfellow 2016)

HMC approach?p(X | ✓)p(X | ✓⇤)

= ⇧ip(x(i) | ✓)p(x(i) | ✓⇤)

• Allows estimation of unnormalized likelihoods via discriminator

• Drawbacks: • Discriminator needs to be re-optimized after visiting

each new parameter value • For the likelihood estimate to be a function of the

parameters, we must include the discriminator learning process in the graph for the estimate, as in unrolled GANs (Metz et al 2016)

(Goodfellow 2016)

Variational Bayeszz

xx

BRIEF ARTICLE

THE AUTHOR

Maximum likelihood

✓

⇤= argmax

✓Ex⇠pdata log pmodel

(x | ✓)

Fully-visible belief net

pmodel

(x) = pmodel

(x1

)

nY

i=2

pmodel

(xi

| x1

, . . . , xi�1

)

Change of variables

y = g(x) ) px

(x) = py

(g(x))

��det✓@g(x)

@x

◆��

Variational bound

log p(x) � log p(x)�DKL

(q(z)kp(z | x))(1)

=Ez⇠q

log p(x, z) +H(q)(2)

1

• Same graphical model structure as GANs • Often limited by expressivity of q

(Goodfellow 2016)

Arbitrary capacity posterior via backwards GAN

zz

xx

zz

xx uuGeneration process Posterior sampling process

(Goodfellow 2016)

Related variants• Adversarial autoencoder (Makhzani et al 2015)

• Variational lower bound for training decoder

• Adversarial training of encoder

• Restricted encoder

• Makes aggregate approximate posterior indistinguishable from prior, rather than approximate posterior indistinguishable from true posterior

• Uses variational lower bound for training decoder

(Goodfellow 2016)

ALI / BiGAN

• Adversarially Learned Inference (Dumoulin et al 2016)

• Gaussian encoder

• BiGAN (Donahue et al 2016)

• Deterministic encoder

(Goodfellow 2016)

Adversarial Examples

panda 58% confidence

gibbon 99% confidence

(Goodfellow 2016)

Overly linear, increasingly confident extrapolation

Arg

umen

t to

sof

tmax

(Goodfellow 2016)

Designing priors on latent factors

- Both these two class mixture models implement roughly the same marginal over x, with very different posteriors over the classes. The likelihood criterion cannot strongly prefer one to the other, and in many cases will prefer the bad one.

(Goodfellow 2016)

RBFs are better than linear models

Attacking a linear model Attacking an RBF model

(Goodfellow 2016)

Possible Bayesian solutions• Bayesian neural network

• Better confidence estimates might solve the problem

• So far, has not worked, but may just need more effort

• Variational approach

• MC dropout

• Regularize neural network to emulate Bayesian model with RBF kernel (amortized inference of Bayesian model)

(Goodfellow 2016)

Universal engineering machine (model-based optimization)

Training data Extrapolation

Make new inventions by finding input that maximizes model’s predicted performance

(Goodfellow 2016)

Conclusion• Generative adversarial nets may be able to

• Sample from the Bayesian posterior over parameters

• Implement an arbitrary capacity q for variational Bayes

• Bayesian learning may be able to solve the adversarial example problem and unlock the potential of model-based optimization

Adversarial Approaches to Bayesian Learning and Bayesian ... · Learning and Bayesian Approaches to...

Documents

Transcript of Adversarial Approaches to Bayesian Learning and Bayesian ... · Learning and Bayesian Approaches to...