Continuous Hierarchical Representations with Poincaré ...emilemathieu.fr/PDFs/pvae_poster.pdf ·...

1
Continuous Hierarchical Representations with Poincaré Variational Auto-Encoders Emile Mathieu , Charline Le Lan , Chris J. Maddison †* , Ryota Tomioka and Yee Whye Teh †* Department of Statistics, University of Oxford, United Kingdom, * DeepMind, London, United Kingdom, Microsoft Research, Cambridge, United Kingdom Overview & Contributions Many real datasets are hierarchically structured. However, tra- ditional variational auto-encoders ( VAE s) [1, 2] map data in a Eu- clidean latent space which cannot efficiently embed tree-like struc- tures. Hyperbolic spaces with negative curvature can [3] and their smoothness is well-suited for gradient based approaches [4]. 1. We empirically demonstrate that endowing a VAE with a Poincaré ball latent space can be beneficial in terms of model generalisation and can yield more interpretable representations. 2. We propose efficient and reparametrisable sampling schemes, and calculate the probability density functions, for two canonical Gaussian generalisations defined on the Poincaré ball, namely the maximum-entropy and wrapped normal distributions. 3. We introduce a decoder architecture taking into account the hyperbolic geometry, which we empirically show to be crucial. The Poincaré ball model of hyperbolic geometry The d-dimensional Poincaré ball with curvature -c is the Riemannian manifold B d c =(B d c , g c p ) [5], where B d c = {z R d |kz k 2 1/ c} , and g c p its metric tensor, g c p (z )=(λ c z ) 2 g e (z )= 2 1 - c kz k 2 ! 2 g e (z ) (1) (a) Isometric embedding of tree. (b) Geodesics and exponential maps. We are extremely grateful to Adam Foster, Phillipe Gagnon and Emmanuel Chevallier for their help. EM, YWT’s research leading to these results received funding from the European Research Council under the European Union’s Seventh Framework Programme (FP7/2007- 2013) ERC grant agreement no. 617071 and they acknowledge Microsoft Research and EPSRC for funding EM’s studentship, and EPSRC grant agreement no. EP/N509711/1 for funding CL’s studentship. [1] Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. In Proceedings of the International Conference on Learning Representations (ICLR), 2014. [2] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic Backpropagation and Approximate Inference in Deep Generative Models. In ICML, 2014. [3] Rik Sarkar. Low distortion delaunay embedding of trees in hyperbolic plane. In Marc van Kreveld and Bettina Speckmann, editors, Graph Drawing, pages 355–366. Springer Berlin Heidelberg, 2012. [4] Maximilian Nickel and Douwe Kiela. Poincaré embeddings for learning hierarchical representations. In Advances in Neural Information Processing Systems, pages 6341–6350. 2017. [5] E. Beltrami. Teoria fondamentale degli spazii di curvatura costante: memoria. F. Zanetti, 1868. [6] Xavier Pennec. Intrinsic statistics on riemannian manifolds: Basic tools for geometric measurements. Journal of Mathematical Imaging and Vision, 25(1):127, Jul 2006. [7] Salem Said, Lionel Bombrun, and Yannick Berthoumieu. New riemannian priors on the univariate normal model. Entropy, 16(7):4015–4031, 2014. [8] Octavian-Eugen Ganea, Gary Bécigneul, and Thomas Hofmann. Hyperbolic neural networks. In International Conference on Neural Information Processing Systems, pages 5350–5360, 2018. [9] Michael Figurnov, Shakir Mohamed, and Andriy Mnih. Implicit reparameterization gradients. In International Conference on Neural Information Processing Systems, pages 439–450, 2018. [10] Thomas N Kipf and Max Welling. Variational graph auto-encoders. Workshop on Bayesian Deep Learning, NIPS, 2016. Hyperbolic normal distributions reparametrisation Invariant measure: dM(z )= p |G(z )|dz =(λ c z ) d dz with dz the Lebesgue measure Riemannian normal: N R B d c (z |μ2 ) exp ( -d c p (μ, z ) 2 /2σ 2 ) (maximum-entropy [6, 7]) Wrapped normal: z = exp c μ ( v c μ ) with v ∼N (·|0, Σ) (push-forward) N W B d c (z |μ, Σ) = N ( λ c μ log μ (z ) 0, Σ )( cd c p (μ, z )/sinh( cd c p (μ, z )) ) d-1 Reparametrisation through the exponential map: z ∼N B d c (z |μ2 ) dM(z ) z = exp c μ G(μ) - 1 2 v = exp c μ ( v c μ ) (2) Isotropic: v = r α, with direction α ∼U (S d-1 ), and hyperbolic radius r = d c p (μ, x) density ρ R (r ) 1 R + (r )e - r 2 2σ 2 sinh( cr ) c d-1 --→ c0 ρ W (r ) 1 R + (r ) e - r 2 2σ 2 r d-1 (3) σ = (1.0, 1.0) ) σ = (2.0, 0.5) ) σ = (0.5, 2.0) ) Figure 2: Wrapped normal probability measures for Fréchet means μ, concentrations Σ= diag(σ ) and c =1. Decoder & encoder architectures Decoder Compute geodesic distance to hyperplanes (i.e. gyroplanes) f a,p (z )= ha, z - pi = sign (ha, z - pi) kak d E (z ,H c a,p ), with,H a,p = p + {a} . f c a,p (z )= sign a, log c p (z ) p kak p d c p (z ,H c a,p ), with,H c a,p = exp c p ({a} )[8]. Figure 3: Orthogonal projection on a hyperplane in B 2 c (a) and R 2 (b). Encoder μ = exp μ (enc μ φ (x)) B d c and σ = softplus(enc σ φ (x)) R + * . Training Model p θ (x|z )= p(x|dec θ (z )), p(z )= N B d c (z |02 0 ) and q φ (z |x)= N B d c (z |enc μ φ (x), enc σ φ (x) 2 ). ELBO log p(x) ≥L M (x; θ, φ) , Z M ln p θ (x|z )p(z ) q φ (z |x) q φ (z |x) dM(z ) via Monte Carlo (4) Gradients μ z via reparametrisation. σ z via reparametrisation for the wrapped normal and via implicit reparametrisation [9] of ρ R via its cdf F R (r ; σ ) for the Riemannian normal. Branching diffusion process Nodes (x 1 ,..., x N ) R n are hierarchically sampled as follow x i ∼N ( ·|x π (i) 2 ) i 1,...,N, π (i): parent of ith node Table 1: Negative test marginal likelihood estimates L IWAE (with 5000 samples). Models σ 0 N -VAE P 0.1 -VAE P 0.3 -VAE P 0.8 -VAE P 1.0 -VAE P 1.2 -VAE L IWAE 1 57.1 ±0.2 57.1 ±0.2 57.2 ±0.2 56.9 ±0.2 56.7 ±0.2 56.6 ±0.2 L IWAE 1.7 57.0 ±0.2 56.8 ±0.2 56.6 ±0.2 55.9 ±0.2 55.7 ±0.2 55.6 ±0.2 Figure 4: Latent representations learned by P 1 -VAE (a), N -VAE (b). Embeddings are represented by black crosses, and colour dots are posterior samples. Blue lines represent true hierarchy. MNIST digits One can view the natural clustering in MNIST images as a hierarchy with each of the 10 classes being internal nodes of the hierarchy. Figure 5: Decoder ablation study with wrapped normal P 1 -VAE. Baseline decoder is a MLP. We additionally compare the gyroplane layer introduced in the Decoder Section to a MLP pre-composed by log 0 . 2 5 10 20 Latent space dimension 0.0 0.5 1.0 1.5 2.0 Δ r marginal log-likelihood (%) log 0 ∘ MLP gyroplane Table 2: Per digit accuracy of a classifier trained on the 2-d latent embeddings. Results are averaged over 10 sets of embeddings and 5 classifier trainings. Digits 0 1 2 3 4 5 6 7 8 9 Avg N -VAE 89 97 81 75 59 43 89 78 68 57 73.6 P 1.4 -VAE 94 97 82 79 69 47 90 77 68 53 75.6 Graph embeddings We evaluate the performance of a variational graph auto-encoder ( VGAE )[10] for link prediction in networks. Table 3: Results on network link prediction. 95% confidence intervals are computed over 40 runs. Phylogenetic CS PhDs Diseases AUC AP AUC AP AUC AP N -VAE 54.2 ±2.2 54.0 ±2.1 56.5 ±1.1 56.4 ±1.1 89.8 ±0.7 91.8 ±0.7 P -VAE 59.0 ±1.9 55.5 ±1.6 59.8 ±1.2 56.7 ±1.2 92.3 ±0.7 93.6 ±0.5

Transcript of Continuous Hierarchical Representations with Poincaré ...emilemathieu.fr/PDFs/pvae_poster.pdf ·...

Page 1: Continuous Hierarchical Representations with Poincaré ...emilemathieu.fr/PDFs/pvae_poster.pdf · [1]Diederik P. Kingma and Max Welling.Auto-encoding variational bayes.In Proceedings

Continuous Hierarchical Representations with PoincaréVariational Auto-Encoders

Emile Mathieu†, Charline Le Lan†, Chris J. Maddison†∗, Ryota Tomioka‡ and Yee Whye Teh†∗† Department of Statistics, University of Oxford, United Kingdom, ∗ DeepMind, London, United Kingdom, ‡ Microsoft Research, Cambridge, United Kingdom

5

University of Oxford visual identity guidelines

At the heart of our visual identity is the Oxford logo.It should appear on everything we produce, from letterheads to leaflets and from online banners to bookmarks.

The primary quadrangle logo consists of an Oxford blue (Pantone 282) square with the words UNIVERSITY OF OXFORD at the foot and the belted crest in the top right-hand corner reversed out in white.

The word OXFORD is a specially drawn typeface while all other text elements use the typeface Foundry Sterling.

The secondary version of the Oxford logo, the horizontal rectangle logo, is only to be used where height (vertical space) is restricted.

These standard versions of the Oxford logo are intended for use on white or light-coloured backgrounds, including light uncomplicated photographic backgrounds.

Examples of how these logos should be used for various applications appear in the following pages.

NOTEThe minimum size for the quadrangle logo and the rectangle logo is 24mm wide. Smaller versions with bolder elements are available for use down to 15mm wide. See page 7.

The Oxford logoQuadrangle Logo

Rectangle Logo

This is the square logo of first choice or primary Oxford logo.

The rectangular secondary Oxford logo is for use only where height is restricted.

Overview & Contributions

Many real datasets are hierarchically structured. However, tra-ditional variational auto-encoders (VAEs) [1, 2] map data in a Eu-clidean latent space which cannot efficiently embed tree-like struc-tures. Hyperbolic spaces with negative curvature can [3] and theirsmoothness is well-suited for gradient based approaches [4].

1. We empirically demonstrate that endowing a VAE with aPoincaré ball latent space can be beneficial in terms of modelgeneralisation and can yield more interpretable representations.

2. We propose efficient and reparametrisable samplingschemes, and calculate the probability density functions,for two canonical Gaussian generalisations defined on thePoincaré ball, namely the maximum-entropy and wrappednormal distributions.

3. We introduce a decoder architecture taking into account thehyperbolic geometry, which we empirically show to be crucial.

The Poincaré ball model of hyperbolic geometry

The d-dimensional Poincaré ball with curvature−c is the Riemannianmanifold Bdc = (Bdc , gcp) [5], where Bdc = {z ∈ Rd | ‖z‖2 ≤ 1/

√c} , and

gcp its metric tensor,

gcp(z) = (λcz)2 ge(z) =

(2

1− c ‖z‖2

)2

ge(z) (1)

(a) Isometric embedding of tree.

0

z2

v1v2 exp0(v1)expz2(v2)

(b) Geodesics and exponential maps.

We are extremely grateful to Adam Foster, Phillipe Gagnon and Emmanuel Chevallier for their help. EM,YWT’s research leading to these results received funding from the European Research Council under theEuropean Union’s Seventh Framework Programme (FP7/2007- 2013) ERC grant agreement no. 617071and they acknowledge Microsoft Research and EPSRC for funding EM’s studentship, and EPSRC grantagreement no. EP/N509711/1 for funding CL’s studentship.

[1] Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. In Proceedings of theInternational Conference on Learning Representations (ICLR), 2014.

[2] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic Backpropagation andApproximate Inference in Deep Generative Models. In ICML, 2014.

[3] Rik Sarkar. Low distortion delaunay embedding of trees in hyperbolic plane. In Marc van Kreveldand Bettina Speckmann, editors, Graph Drawing, pages 355–366. Springer Berlin Heidelberg, 2012.

[4] Maximilian Nickel and Douwe Kiela. Poincaré embeddings for learning hierarchical representations.In Advances in Neural Information Processing Systems, pages 6341–6350. 2017.

[5] E. Beltrami. Teoria fondamentale degli spazii di curvatura costante: memoria. F. Zanetti, 1868.[6] Xavier Pennec. Intrinsic statistics on riemannian manifolds: Basic tools for geometric measurements.

Journal of Mathematical Imaging and Vision, 25(1):127, Jul 2006.[7] Salem Said, Lionel Bombrun, and Yannick Berthoumieu. New riemannian priors on the univariate

normal model. Entropy, 16(7):4015–4031, 2014.[8] Octavian-Eugen Ganea, Gary Bécigneul, and Thomas Hofmann. Hyperbolic neural networks. In

International Conference on Neural Information Processing Systems, pages 5350–5360, 2018.[9] Michael Figurnov, Shakir Mohamed, and Andriy Mnih. Implicit reparameterization gradients. In

International Conference on Neural Information Processing Systems, pages 439–450, 2018.[10] Thomas N Kipf and Max Welling. Variational graph auto-encoders. Workshop on Bayesian Deep

Learning, NIPS, 2016.

Hyperbolic normal distributions reparametrisation

Invariant measure: dM(z) =√|G(z)|dz = (λcz)

ddz with dz the Lebesgue measure

Riemannian normal: NRBdc

(z|µ, σ2) ∝ exp(−dcp(µ, z)2/2σ2

)(maximum-entropy [6, 7])

Wrapped normal: z = expcµ(v/λcµ

)with v ∼ N (·|0,Σ) (push-forward)

NWBdc

(z|µ,Σ) = N(λcµ logµ(z)

∣∣0,Σ) (√c dcp(µ, z)/sinh(√c dcp(µ, z))

)d−1

Reparametrisation through the exponential map: z ∼ NBdc(z|µ, σ2) dM(z)

z = expcµ

(G(µ)−

12 v)

= expcµ(v/λcµ

)(2)

Isotropic: v = r α, with direction α ∼ U(Sd−1), and hyperbolic radius r = dcp(µ,x) density

ρR(r) ∝ 1R+(r)e−

r2

2σ2

(sinh(√cr)√c

)d−1

−−→c→0

ρW(r) ∝ 1R+(r) e−

r2

2σ2 rd−1 (3)

σ = (1.0, 1.0)) σ = (2.0, 0.5)) σ = (0.5, 2.0))Figure 2: Wrapped normal probability measures for Fréchet means µ, concentrations Σ = diag(σ) and c = 1.

Decoder & encoder architectures

Decoder Compute geodesic distance to hyperplanes (i.e. gyroplanes)

fa,p(z) = 〈a, z − p〉 = sign (〈a, z − p〉) ‖a‖ dE(z, Hca,p), with, Ha,p = p + {a}⊥.

f ca,p(z) = sign(⟨a, logcp(z)

⟩p

)‖a‖p d

cp(z, H

ca,p), with, Hc

a,p = expcp({a}⊥) [8].

p

Ha, pa

z

dcp(z, Ha, p)

p

a Ha, p

z

dcp(z, Ha, p)

Figure 3: Orthogonal projection on a hyperplane in B2c (a) and R2 (b).

Encoder µ = expµ(encµφ(x)) ∈ Bdc and σ = softplus(encσφ(x)) ∈ R+∗ .

Training

Model pθ(x|z) = p(x|decθ(z)), p(z) = NBdc(z|0, σ20) and qφ(z|x) = NBdc(z|encµφ(x),encσφ(x)2).

ELBO log p(x) ≥ LM(x; θ, φ) ,∫M

ln

(pθ(x|z)p(z)

qφ(z|x)

)qφ(z|x) dM(z) via Monte Carlo (4)

Gradients ∇µz via reparametrisation. ∇σz via reparametrisation for the wrapped normal andvia implicit reparametrisation [9] of ρR via its cdf FR(r;σ) for the Riemannian normal.

Branching diffusion process

Nodes (x1, . . . ,xN) ∈ Rn are hierarchically sampled as follow

xi ∼ N(· |xπ(i), γ

2)∀i ∈ 1, . . . , N, π(i) : parent of ith node

Table 1: Negative test marginal likelihood estimates LIWAE (with 5000 samples).

Modelsσ0 N -VAE P0.1-VAE P0.3-VAE P0.8-VAE P1.0-VAE P1.2-VAE

LIWAE 1 57.1±0.2 57.1±0.2 57.2±0.2 56.9±0.2 56.7±0.2 56.6±0.2

LIWAE 1.7 57.0±0.2 56.8±0.2 56.6±0.2 55.9±0.2 55.7±0.2 55.6±0.2

Figure 4: Latent representations learned by P1-VAE (a), N -VAE (b).Embeddings are represented by black crosses, and colour dots are posteriorsamples. Blue lines represent true hierarchy.

MNIST digits

One can view the natural clustering in MNIST images as a hierarchywith each of the 10 classes being internal nodes of the hierarchy.

Figure 5: Decoder ablationstudy with wrapped normalP1-VAE. Baseline decoder is aMLP. We additionally comparethe gyroplane layer introducedin the Decoder Section to aMLP pre-composed by log0. 2 5 10 20

Latent space dimension

0.0

0.5

1.0

1.5

2.0

Δ r m

argi

nal l

og-li

kelih

ood

(%)

log0 ∘ MLPgyroplane

Table 2: Per digit accuracy of a classifier trained on the 2-d latent embeddings.Results are averaged over 10 sets of embeddings and 5 classifier trainings.

Digits 0 1 2 3 4 5 6 7 8 9 AvgN -VAE 89 97 81 75 59 43 89 78 68 57 73.6P1.4-VAE 94 97 82 79 69 47 90 77 68 53 75.6

Graph embeddings

We evaluate the performance of a variational graph auto-encoder(VGAE) [10] for link prediction in networks.Table 3: Results on network link prediction. 95% confidence intervals arecomputed over 40 runs.

Phylogenetic CS PhDs Diseases

AUC AP AUC AP AUC APN -VAE 54.2±2.2 54.0±2.1 56.5±1.1 56.4±1.1 89.8±0.7 91.8±0.7

P-VAE 59.0±1.9 55.5±1.6 59.8±1.2 56.7±1.2 92.3±0.7 93.6±0.5