1, 380 420 480 530 480 580 680 480 820 750 780 780 k 780 ...
CS 480/680 Lecture 23: July 24, 2019ppoupart/teaching/cs480-spring19/slides/cs480-lecture23.pdfCS...
Transcript of CS 480/680 Lecture 23: July 24, 2019ppoupart/teaching/cs480-spring19/slides/cs480-lecture23.pdfCS...
CS 480/680 Lecture 23: July 24, 2019
Normalizing Flows
CS480/680, Guest Lecture, Priyank JainiUniversity of Waterloo Spring 2019
[GBC] Sec: 20.10.7Complimentary Reading:
• Sum-of-Squares Polynomial Flows, ICML 2019• Tutorial on Normalizing Flows, Eric Jang
density estimation
data = {x1, x2, …, xn}
estimate q(x)
importance sampling
many applications…
Jaini et. al. Online Bayesian Transfer Learning for Sequential Data Modeling, ICLR 2017
image & audio synthesisvan den oord et. al. Pixel Recurrent Neural Networks, ICML 2016
van den oord et. al. WaveNet: A Generative Model for Raw Audio, SSW 2016Kingma et. al. Glow:Generative Flow with Invertible 1x1 Convolutions, NeurIPS 2018
bayesian inference
Moselhy, T., Marzouk, Y. et.al. Bayesian Inference with Optimal Maps, JCP 2012
conditional generative models
courtesy: wimbledon courtesy: 9GAG
GANs and VAE’s
xqreal
Dθ
z
Gφ
qsynth
p(z)
Real (1) / Fake(0)
z p(z)
x
x
Encoder
Decoder
fθ
gφ
implicit representation of density functions
source distribution p(z)
target distribution q(x)
T
agenda
explicit representation of density functions
find deterministic maps from source to target density
conversation of probability mass
0
0
1
13
p(z)
q(x)
z
x
dz
dx
p(z)dz = q(x)dx
q(x) = p(z)dzdxEric Jang, Tutorial on Normalizing Flows
0
0
1
1
1
4
13
p(z)
q(x)
z
x
x := T(z) = 3z + 1
Transforming a uniform random variable into
another uniform variable
change of variables
0
0
1
13
p(z)
q(x)
z
x
dz
dx
p(z)dz = q(x)dx
q(x) = p(z)dzdx
univariate
T : 𝖹 → 𝖷
𝖹 :
𝖷 :
source random variable and density p(z)
source random variable and density q(x)
q(x) = p(z)∂T(z)
∂z
−1
q(x) = p(z) det(∇zT(z))−1
multivariate
T : ℝd → ℝd
Eric Jang, Tutorial on Normalizing Flows
recipe for learning
Given: dataset 𝒟 := {x1, x2, x3, …, xn} ∼ q(x)
learn the density q(x)
choose a simple source density p(z)
use maximum likelihood
n
∏i=1
q(xi) =n
∏i=1
p(zi) det(∇zT(zi))−1
T := arg maxT
n
∏i=1
p(zi) det(∇zT(zi))−1
T := arg maxT
n
∑i=1
log p(zi) − log det(∇zT(zi))
But…..what are � ?zi
zi = T−1(xi)
so we need the inverse T−1
computing determinant is computationally expensive
triangular maps
computation of inverse and Jacobian must be cheap
increasing triangular maps
x1 = T1(z1)x2 = T2(z1, z2)x3 = T3(z1, z2, z3)
xd = Td(z1, z2, z3, …, zd)
∇zT =
∂T1
∂z10 … 0
∂T2
∂z1
∂T2
∂z2… 0
⋮ ⋮ ⋱ ⋮∂Td
∂z1
∂Td
∂z2…
∂Td
∂zd
triangular mapsinverse and Jacobian are easy to compute
T : ℝd → ℝd
triangular : Tj is a function of z1, z2, …, zj increasing :Tj is increasing w.r.t zj
∂Tj
∂zj> 0
Bogachev, V. et. al. Triangular Transformation of Measures, Sbornik: Mathematics, 2005
Theorem (paraphrase) : there always exists a unique* increasing triangular map that transforms a source density to a target density
* for a fixed ordering
framework
z1
T1
x1
z2
T2
x2
zd
Td
xd
z ∼ p(z) x ∼ q(x)
minT
n
∑i=1
[− log p(T−1(xi)) + ∑j
log ∂jTj(T−1(xi))]learn T by maximizing likelihood
Marzouk, Y. et.al. Sampling via Measure Transport: An Introduction, Springer, 2016
z1C
x1
w1z
z1
z2
x1
x2T
x
{z1, …, zl−1}
{zl, …, zd}
{x1, …, xl−1}
{xl, …, xd}
x1
x2
xd
flow models as triangular mapsstudy commonalities & differences of flow based methods
autoregressive modelsq(x) = q1(x1) ⋅ q2(x2 |x1) ⋅ … ⋅ qd(xd |x<d)
x1z1
T1
z1 x1
zd
Td
xdzd xd
choosing a conditional implicitly fixes a family of triangular maps
q1(x1)x1
q2(x2 |x1)x2
qd(xd |x<d)xd
Larochelle, H and Murray, I. The Neural Autoregressive Distribution Estimator, AISTATS, 2011Uria, et.al. Neural Autoregressive Distribution Estimation, JMLR, 2016
z2
T2
x2z2 x2
xj = Tj (zj; θj(z<j))
AR with Gaussian conditionals
z1
T1
x1z1 x1
z2
T2
x2z2 x2
x1 = σ1 ⋅ z1 + μ1 =: T1(z1)
x2 = σ2(z1) ⋅ z2 + μ2(z1) := T2(z2 ; z1)
zd
Td
xdzd xd
xd = σd(z<d) ⋅ zd + μd(z<d) := Td(zd ; z<d)
q(x) = q1(x1) ⋅ q2(x2 |x1) ⋅ … ⋅ qd(xd |x<d)
𝒩(μ1, σ21) 𝒩(μ2, σ2
2) 𝒩(μd, σ2d)
Kingma, et.al. Improved Variational Inference with Inverse Autoregressive Flow, NeurIPS, 2016
masked autoregressive flows (MAFs)
deep autoregressive flows with Gaussian conditionals*
Papamakarios, et.al. Masked Autoregressive Flow for Density Estimation, NeurIPS, 2017 * can consider mixture of gaussians
z1z
p(z) ⋅ ∇T(1)−1
AR
T(1) T(2)
⋅ ∇T(2)−1
z2
ARP
⋅ ∇T(3)−1
z3
AR
T(3)
P
q(x) = ⋅ ∇T(4)−1
xj = zj ⋅ exp(αj(z<j)) + μj(z<j) =: Tj(zj ; z<j)
x
AR
T(4)
P
triangular maps are fundamental blocks for complex models
real-NVP
Tj(zj ; z<l) = exp(αj(z<l) ⋅ 1j∉[l−1]) ⋅ zj + μj(z<l) ⋅ 1j∉[l−1]
z
z1
z2
{z1, …, zl−1}
{zl, …, zd}
x1
{x1, …, xl−1}
x2T
x
{xl, …, xd}
Dinh, et.al. Density estimation using Real NVP, ICLR, 2017
neural autoregressive flows (NAFs)
z1
x1
C
w1
z1
z2C
x2
w2
zdC
xd
z1, …, zd−1
wd
Strictly positive weights & strictly monotonic activation function ensure that the map is increasing
Huang, et.al. Neural Autoregressive Flows, ICML, 2018
Universal
xj = DNN (zj ; wj(z<j)) =: Tj(zj ; z<j)
sum-of-squares polynomial flows
goal : learn a universal increasing triangular function
∫+a0
x1
C
{a1, a2, a0}
∑∂T1
∂z1
z1
𝔓2r(z1; a1) 𝔓2
r(z1; a2)
𝔓r(z; a) =r
∑l=0
al,kzl
zd
𝔓2r(zd; a′�1) 𝔓2
r(zd; a′�2)
C
{a′�1, a′�2, a′�0}
z1, …, zd−1
∑
∫+a′�0
xd
∂Td
∂zd
increasing functions ≈ primitive of non-
negative polynomials= primitive of sum-of-
squares of polynomials
Jaini, et.al. Sum-of-Squares Polynomial Flow, ICML, 2019
SOS flows
Interpretablecoefficients control higher order moments
Easier to trainno constraints on coefficients
Universalapproximate any density given enough capacity
Application to stochastic simulationz1
C
∑
∫+c
x1
𝔓2r(z1; a1) 𝔓2
r(z1; a2)
{a1, a2}
Generalization of other flows (e.g. IAF, MAF, Real-NVP etc)
Jaini, et.al. Sum-of-Squares Polynomial Flow, ICML, 2019
Glow : invertible 1x1 convolutions
z
x
data dependent
random rotation matrix
Kingma, D.P. and Dhariwal, P, et.al. Glow: Generative Flow with Invertible 1x1 Convolutions, NeurIPS, 2018