The Expressive Power of Planar Flowscseweb.ucsd.edu/~z4kong/files/DL_Workshop_Poster.pdf · Results...

1
The Expressive Power of Planar Flows Zhifeng Kong, Kamalika Chaudhuri University of California San Diego {z4kong, kamalika}@eng.ucsd.edu Abstract Normalizing flows have received a great deal of recent attention as they allow flexible generative modeling as well as easy likelihood computation. While a wide variety of flow models have been proposed, there is little formal understanding of the representation power of these models. In this work, we study a class of simple normalizing flows called planar flows, and rigorously establish bounds on their expressive power. Our results indicate that while these flows are highly expressive in one dimension, in higher dimensions their representation power may be limited, especially for planar flows of moderate depth. Background: Normalizing Flows (NF) source distribution target distribution Figure: A normalizing flow model that transforms the source distribution p 0 (z 0 ) to the target distribution p K (z K ). Figure by Lilian Weng, https://lilianweng.github.io/lil-log/assets/images/normalizing-flow.png I z i R d , 0 i K . I p K = f #p 0 , where f = f K ◦···◦ f 1 . I Each f i : R d R d is simple, invertible, and parameterized. I Density computation: log p K (z K ) = log p 0 (z 0 ) - K X i=1 log | det J f i (z i )|,z i = f i (z i-1 ), 1 i K I Solve MLE a generative model with computable likelihood. Definition (planar flow [Rezende and Mohamed, 2015]) A planar flow f pf is defined by an invertible function f pf (z )= z + uh(w > z + b), where u, w, z R d ,b R with non-linearity h : R R. Figure: The output distributions transformed from different source distributions with different #layers of planar flows (h = tanh). Figure in [Rezende and Mohamed, 2015]. Problem Statement Suppose f is composed of T planar flows: f = f T ◦···◦ f 1 . Let q be the source(input) distribution and p be the target distribution. I Q 1 Exact transformation : when does it satisfy p = f #q (a.e.) I Q 2 Approximation : given > 0, is there a bound on T s.t. kf #q - pk 1 Challenge Suppose F is a function class and I = {all invertible functions}. I F is a universal approximator ; F∩I can transform between arbitrary distributions. I F has limited expressivity ; F∩I is not a universal approximator in transforming distributions. Therefore, the expressive power of F in the function space does not indicate the expressive power of F∩I in transforming distributions. Our technique: directly look at input-output distribution pairs. Results for the d =1 case: universal approximation Theorem (universal approximation for the ReLU Non-linearity) Let p be a density on R supported on a finite union of intervals. Then, for any > 0, there exists a flow f composed of finitely many ReLU planar flows and a Gaussian distribution q N such that kf #q N - pk 1 . -3 -2 -1 0 1 2 3 0.0 0.1 0.2 0.3 0.4 0.5 1 st piece 2 nd piece 3 rd piece Figure: A tail-consistent piecewise Gaussian distribution of 3 pieces. Sketch of proof. First, we show by construction that if f is a normalizing flow composed of n - 1 ReLU planar flows, then f #q N can express any tail-consistent piecewise Gaussian distribution. Then, we use tail-consistent piecewise Gaussian distributions to approximate piecewise constant distributions. -4 -3 -2 -1 0 1 2 3 4 0.00 0.05 0.10 0.15 0.20 0.25 0.30 p q pwc q pwg -4 -3 -2 -1 0 1 2 3 4 0.00 0.05 0.10 0.15 0.20 0.25 0.30 p q pwc q pwg Figure: Target distribution p, its piecewise constant distribution approximation q pwc of 50 (left)/300 (right) pieces, and its tail-consistent piecewise Gaussian distribution approximation q pwg generated by 50 (left)/300 (right) ReLU planar flows over a Gaussian. Results for high-d exact transformation: topology matching Suppose distribution q is defined on R d . Theorem (planar flows with h = ReLU) Suppose f is composed of finitely many ReLU planar flows. Let p = f #q . Then, there exists a zero-measure closed set Ω R d such that z R d \ Ω, we have J f (z ) > z log p(f (z )) = z log q (z ). Figure: The surface plot of q (left), a mixture of Gaussian distribution with 4 peaks located at (±1, ±1), and p = f #q (right), the transformed distribution of q . The red points correspond to peaks of q and are mapped to peaks of p. Corollary (MoG9MoG, Prod9Prod) Suppose p, q are (i) mixture of Gaussian distributions: p(z )= r p X i=1 w i p N (z ; μ i p , Σ p ),q (z )= r q X j =1 w j q N (z ; μ j q , Σ q ) or (ii) product distributions: p(z ) d Y i=1 g (z i ) r p ; q (z ) d Y i=1 g (z i ) r q ,r p > 0,r q > 0,r p 6= r q where g is a smooth function. Then, there generally does not exist flow f composed of finitely many ReLU planar flows such that p = f #q . Theorem (planar flows with general smooth h) Suppose f = f (n) pf ◦···◦ f (1) pf where f (i) pf (z )= z + u i h(w > i z + b i ). Let p = f #q . Then z R d , we have z log p(f (z )) -∇ z log q (z ) span{w 1 , ··· ,w n }. Results for high-d approximation Let q,p be the input distribution and the target distribution on R d . Definition (1 norm approximation lower bound) Let F be a set of normalizing flows. Then for any > 0, the minimum number of flows in F required to transform q to an approximation of p to within is T (p, q, F ) = inf {n : ∃{f i } n i=1 ∈F such that k(f 1 ◦···◦ f n )#q - pk 1 } Theorem (1 norm approximation lower bound for local planar flows) A planar flow f pf = z + uh(w > z + b) is called c h -local if kuk 2 1, kw k 2 1, |h(x)|≤ c h , and |h 0 (x)|≤ c h /(1 + |x|). Suppose F is the set of all c h -local planar flows, q is a random initialization, and p satisfies for τ (0, 1): I p = O (p 1 ), where density p 1 (z ) exp(-kz k τ 2 ); I k∇pk 2 = O (k∇p 2 k 2 ), where density p 2 (z ) ( exp(-d) kz k 2 d 1 τ exp(-kz k τ 2 ) kz k 2 >d 1 τ . Then = Θ(1) such that T (p, q, F )=Ω min (log d) - 1 τ d ( 1 τ - 1 2 ) ,d ( 1 τ -1 ) . Sketch of proof. Let L(p, f ) = sup q 0 is a density on R d kp - q 0 k 1 -kp - f #q 0 k 1 . Then, L(p, f ) ˆ L(p, f )= R R d || det J f (z )|p(f (z )) - p(z )| dz . If we can bound ˆ L, then T (p, q, F ) kp - q k 1 - sup f ∈F L(p, f ) kp - q k 1 - sup f ∈F ˆ L(p, f ) 1 sup f ∈F ˆ L(p, f ) ! -3 -2 -1 1 2 3 -1.0 -0.5 0.5 1.0 tanh(x) Sigmoid(x) tan -1 (x) -3 -2 -1 1 2 3 0.2 0.4 0.6 0.8 1.0 tanh (x) Sigmoid (x) tan -1 (x) 1 1+x Figure: Examples of c h -local non-linearities: tanh (c h =2), sigmoid (c h =1), and arctan (c h = π 2 ). Reference Rezende, D. J. and Mohamed, S. (2015). Variational inference with normalizing flows. arXiv preprint arXiv:1505.05770.

Transcript of The Expressive Power of Planar Flowscseweb.ucsd.edu/~z4kong/files/DL_Workshop_Poster.pdf · Results...

Page 1: The Expressive Power of Planar Flowscseweb.ucsd.edu/~z4kong/files/DL_Workshop_Poster.pdf · Results for the d= 1 case: universal approximation Theorem (universal approximation for

The Expressive Power of Planar FlowsZhifeng Kong, Kamalika Chaudhuri

University of California San Diegoz4kong, [email protected]

AbstractNormalizing flows have received a great deal of recent attention as they allowflexible generative modeling as well as easy likelihood computation. While a widevariety of flow models have been proposed, there is little formal understanding ofthe representation power of these models. In this work, we study a class of simplenormalizing flows called planar flows, and rigorously establish bounds on theirexpressive power. Our results indicate that while these flows are highly expressive inone dimension, in higher dimensions their representation power may be limited,especially for planar flows of moderate depth.

Background: Normalizing Flows (NF)

← sourcedistribution ← target

distribution

Figure: A normalizing flow model that transforms the source distribution p0(z0) to the targetdistribution pK(zK). Figure by Lilian Weng, https://lilianweng.github.io/lil-log/assets/images/normalizing-flow.png

I zi ∈ Rd, 0 ≤ i ≤ K.

I pK = f#p0, where f = fK · · · f1.

I Each fi : Rd→ Rd is simple, invertible, and parameterized.

I Density computation:

log pK(zK) = log p0(z0)−K∑i=1

log | det Jfi(zi)|, zi = fi(zi−1), 1 ≤ i ≤ K

I Solve MLE ⇒ a generative model with computable likelihood.

Definition (planar flow [Rezende and Mohamed, 2015])A planar flow fpf is defined by an invertible function fpf(z) = z + uh(w>z + b),where u,w, z ∈ Rd, b ∈ R with non-linearity h : R→ R.

Figure: The output distributions transformed from different source distributions with different#layers of planar flows (h = tanh). Figure in [Rezende and Mohamed, 2015].

Problem StatementSuppose f is composed of T planar flows: f = fT · · · f1.Let q be the source(input) distribution and p be the target distribution.

I Q1–Exact transformation: when does it satisfy

p = f#q (a.e.)

I Q2–Approximation: given ε > 0, is there a bound on T s.t.

‖f#q − p‖1 ≤ ε

ChallengeSuppose F is a function class and I = all invertible functions.I F is a universal approximator ; F ∩ I can transform between arbitrary

distributions.

I F has limited expressivity ; F ∩ I is not a universal approximator intransforming distributions.

Therefore, the expressive power of F in the function space does not indicate theexpressive power of F ∩ I in transforming distributions.Our technique: directly look at input-output distribution pairs.

Results for the d = 1 case: universal approximation

Theorem (universal approximation for the ReLU Non-linearity)Let p be a density on R supported on a finite union of intervals. Then, for anyε > 0, there exists a flow f composed of finitely many ReLU planar flows and aGaussian distribution qN such that ‖f#qN − p‖1 ≤ ε.

−3 −2 −1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

0.51st piece

2nd piece

3rd piece

Figure: A tail-consistent piecewiseGaussian distribution of 3 pieces.

Sketch of proof.First, we show by construction that if f is anormalizing flow composed of n− 1 ReLU planarflows, then f#qN can express any tail-consistentpiecewise Gaussian distribution. Then, we usetail-consistent piecewise Gaussian distributionsto approximate piecewise constant distributions.

−4 −3 −2 −1 0 1 2 3 4

0.00

0.05

0.10

0.15

0.20

0.25

0.30pqpwcqpwg

−4 −3 −2 −1 0 1 2 3 4

0.00

0.05

0.10

0.15

0.20

0.25

0.30 pqpwcqpwg

Figure: Target distribution p, its piecewise constant distribution approximation qpwc of 50 (left)/300(right) pieces, and its tail-consistent piecewise Gaussian distribution approximation qpwg generatedby 50 (left)/300 (right) ReLU planar flows over a Gaussian.

Results for high−d exact transformation: topology matchingSuppose distribution q is defined on Rd.

Theorem (planar flows with h = ReLU)Suppose f is composed of finitely many ReLU planar flows. Let p = f#q. Then,there exists a zero-measure closed set Ω ⊂ Rd such that ∀z ∈ Rd \ Ω, we haveJf(z)>∇z log p(f (z)) = ∇z log q(z).

Figure: The surface plot of q (left), a mixture of Gaussian distribution with 4 peaks located at(±1,±1), and p = f#q (right), the transformed distribution of q. The red points correspond topeaks of q and are mapped to peaks of p.

Corollary (MoG9MoG, Prod9Prod)Suppose p, q are (i) mixture of Gaussian distributions:

p(z) =

rp∑i=1

wipN (z;µip,Σp), q(z) =

rq∑j=1

wjqN (z;µjq,Σq)

or (ii) product distributions:

p(z) ∝d∏i=1

g(zi)rp; q(z) ∝

d∏i=1

g(zi)rq, rp > 0, rq > 0, rp 6= rq

where g is a smooth function. Then, there generally does not exist flow fcomposed of finitely many ReLU planar flows such that p = f#q.

Theorem (planar flows with general smooth h)Suppose f = f

(n)pf · · · f

(1)pf where f

(i)pf (z) = z + uih(w>i z + bi). Let p = f#q.

Then ∀z ∈ Rd, we have ∇z log p(f (z))−∇z log q(z) ∈ spanw1, · · · , wn.

Results for high−d approximationLet q, p be the input distribution and the target distribution on Rd.

Definition (`1 norm approximation lower bound)Let F be a set of normalizing flows. Then for any ε > 0, the minimum number offlows in F required to transform q to an approximation of p to within ε is

Tε(p, q,F) = infn : ∃fini=1 ∈ F such that ‖(f1 · · · fn)#q − p‖1 ≤ ε

Theorem (`1 norm approximation lower bound for local planar flows)A planar flow fpf = z + uh(w>z + b) is called ch-local if ‖u‖2 ≤ 1, ‖w‖2 ≤ 1,|h(x)| ≤ ch, and |h′(x)| ≤ ch/(1 + |x|). Suppose F is the set of all ch-local planarflows, q is a random initialization, and p satisfies for τ ∈ (0, 1):

I p = O(p1), where density p1(z) ∝ exp(−‖z‖τ2);

I ‖∇p‖2 = O(‖∇p2‖2), where density p2(z) ∝

exp(−d) ‖z‖2 ≤ d1τ

exp(−‖z‖τ2) ‖z‖2 > d1τ

.

Then ∃ ε = Θ(1) such that Tε(p, q,F) = Ω(

min(

(log d)−1τd(1

τ−12), d(1

τ−1)))

.

Sketch of proof. Let L(p, f ) = supq′ is a density on Rd ‖p− q′‖1 − ‖p− f#q′‖1. Then,

L(p, f ) ≤ L(p, f ) =∫Rd || det Jf(z)|p(f (z))− p(z)| dz. If we can bound L, then

Tε(p, q,F) ≥ ‖p− q‖1 − εsupf∈F L(p, f )

≥ ‖p− q‖1 − εsupf∈F L(p, f )

= Ω

(1

supf∈F L(p, f )

)

Out[]=-3 -2 -1 1 2 3

-1.0

-0.5

0.5

1.0

tanh(x)

Sigmoid(x)

tan-1(x)

Out[]=

-3 -2 -1 1 2 3

0.2

0.4

0.6

0.8

1.0

tanh′(x)

Sigmoid′(x)

tan-1′(x)

1

1+x

Figure: Examples of ch-local non-linearities: tanh (ch = 2), sigmoid (ch = 1), and arctan (ch = π2).

ReferenceRezende, D. J. and Mohamed, S. (2015).Variational inference with normalizing flows.arXiv preprint arXiv:1505.05770.