Stochastic Generative Hashing - Welcome to Duke ECEece.duke.edu/~lcarin/Ikenna9.8.2017.pdf ·...

Stochastic Generative Hashing

B. Dai1, R. Guo2, S. Kumar2, N. He3 and L. Song1

1Georgia Institute of Technology, 2Google Research, NYC, 3University of Illinois at Urbana-Champaign

Discussion by Ikenna OdinakaSeptember 8, 2017

1 / 14

Outline

1 Introduction

2 Stochastic Generative Hashing

3 Distributional SGD

4 Experiments

2 / 14

Binary Hashing

Represent real-valued signals x ∈ Rd using binary vectors (hash codes)h ∈ 0,1l

Faster search and retrieval e.g. L2 Nearest Neighbor Search (L2NNS)Reduces time and storage requirementsSearch with binary vectors based on Hamming distance can be doneefficientlyData-dependent versus randomized hashingPrevious approaches to data-dependent hashing have two majorshortcomings:

Heuristically constructed objective functionBinary constraints are crudely handled by relaxation leading to subparresults

Stochastic Generative Hashing tackles both problems

3 / 14

Generative Model p(x |h)

p(x |h) can be chosen to based on the problem domainSimple Gaussian distribution for p(x |h) achieves state-of-the-artperformanceJoint distribution is given by

p(x , h) = p(x |h)p(h), (1)

where p(x |h) = N (Uh, ρ2I),p(h) ∼ B(θ) =

∏li=1 θ

hii (1− θi )

1−hi ,U = uil

i=1 , ui ∈ Rd is a codebook with l codewords,θ = [θi ]

li=1 ∈ [0, 1]l

The joint distribution can be written as

p(x , h) ∝ exp(−

12ρ2‖x − Uh‖2

2 + (logθ

1− θ)T h), (2)

4 / 14

Training Objective

Objective function is based on Minimal Description Length (MDL)principleMDL principle tries to minimize the expected amount of information tocommunicate x

L(x) =∑

h

q(h|x)(L(h) + L(x |h)),

where L(h) = − log p(h) + log q(h|x) is the description length of the hash code h,L(x |h) = − log p(x |h) is the description length of x with h already known.

For a given bit size l , MDL principle finds the best parameters tomaximally compress the training dataTraining objective is

minΘ=W ,U,β,ρ

H(Θ) :=∑

xH(Θ; x) = −

∑x

∑h

q(h|x)(log p(x , h)− log q(h|x)), (4)

where β := log θ1−θ

W comes from q(h|x), while U, β, ρ come from p(x ,h)

The objective function is also called Helmholtz (variational) free energy6 / 14

Reparametrization via Stochastic Neuron

Computing gradients w.r.t. W requires back-propagating throughstochastic nodes of binary variable hREINFORCE can be used, but it suffers from high varianceREINFORCE + variance reduction techniques suffer from either beingbiased or expensive to computeReparametrize Bernoulli distribution using stochastic neuronhk (z) is reparameterized with z ∈ (0,1), k = 1, · · · , lThe stochastic neuron is defined as

h(z, ξ) :=

1 if z ≥ ξ0 if z < ξ

, (5)

where ξ ∼ U(0, 1)

h(z, ξ) ∼ B(z), because P(h(z, ξ) = 1) = zReplacing hk ∼ B(σ(wT

k x)) with hk (σ(wTk x), ξk ) in equation (4) gives

H(Θ) =∑

xH(Θ; x) :=

∑x

Eξ[`(h, x)

], (6)

where `(h, x) = − log p(x , h(σ(W T x), ξ)) + log q(h(σ(W T x), ξ)|x)

7 / 14

Theory

Proposition 1 (Neighborhood Preservation)

If ‖U‖F is bounded, then the Gaussian reconstruction error, ‖x − Uhx‖2 is a surrogate forEuclidean neighborhood preservation.

Definition 2 (Distributional derivative)

Let Ω ⊂ Rd be an open set, C∞0 (Ω) denote the space of functions that are infinitely differentiablewith compact support in Ω, and D′(Ω) be the space of continuous linear functionals on C∞0 (Ω).Let u ∈ D′(Ω), then a distribution v is called the distributional derivative of u, denoted as v = Du,if it satisfies ∫

Ωvφdx = −

∫Ω

u∂φdx , ∀φ ∈ C∞0 (Ω).

Lemma 3

For a given sample x , the distributional derivative of function H(Θ; x) w.r.t. W is given by

DW H(Θ; x) = Eξ[∆h`(h(σ(W T x), ξ))σ(W T x) • (1− σ(W T x))xT

], (7)

where • denotes point-wise product and ∆h`(h) denotes the finite difference defined as[∆h`(h)

]k

= `(h1k )− `(h0

k ), where[hi

k

]l

= hl if k 6= l , otherwise[hi

k

]l

= i, i ∈ 0, 1.

8 / 14

Distributional derivative of Stochastic Neuron

H(Θ; x) is not differentiable w.r.t. W because h(z; ξ) is discontinousAn unbiased stochastic estimator of the gradient of H at Θi , using samplexi , ξi , is given as

∇ΘH(Θi ; xi ) =[DW H(Θi ; xi ), ∇U,β,ρH(Θi ; xi )

](8)

The estimator of DW H(Θ; x) needs two forward passes of the model foreach dimension of Θ.An approximate distributional derivative DW H(Θ; x) can compute eachdimension with only one forward pass

DW H(Θ; x) := Eξ[∇h`(h(σ(W T x), ξ))σ(W T x) • (1− σ(W T x))xT

](9)

The approximate stochastic estimator of the gradient of H is given as

ˆ∇ΘH(Θi ; xi ) =[

ˆDW H(Θi ; xi ), ∇U,β,ρH(Θi ; xi )]

(10)

9 / 14

Distributional SGD Algorithm

10 / 14

Convergence of Distributional SGD

Proposition 4

The distributional derivative DW H(Θ; x) is equivalent to the traditional gradient ∇W H(Θ; x).

Theorem 5 (Convergence of Exact Distributional SGD)

Under the assumption that H is L-Lipschitz smooth and the variance of the stochasticdistributional gradient (8) is bounded by σ2 in the distributional SGD, for the solution ΘR sampled

from the trajectory Θiti=1 with probability P(R = i) =

2γi−Lγ2i∑t

i=1 2γi−Lγ2i

where γi ∼ O(1/√

t), we

have

E[‖∇ΘH(ΘR)‖2

]∼ O

(1√

t

).

Theorem 6 (Convergence of Approximate Distributional SGD)

Under the assumption that the variance of the approximate stochastic distributional gradient (10)is bounded by σ2, for the solution ΘR sampled from the trajectory Θit

i=1 with probabilityP(R = i) = γi∑t

i=1 γiwhere γi ∼ O(1/

√t), we have

E[(ΘR −Θ∗)T ∇ΘH(ΘR)

]∼ O

(1√

t

),

where Θ∗ denotes the optimal solution.

11 / 14

Reconstruction Error and Training Time

Figure: Comparison of Stochastic Generative Hashing (SGH) against iterativequantization (ITQ) and binary autoencoder (BA). Convergence of reconstruction errorwith number of SGD training samples. Training time comparison between BA and SGHon SIFT-1M dataset over different bit lengths

12 / 14

Large Scale Nearest Neighbor Retrieval

Figure: Comparison of SGH, ITQ, K -means hashing (KMH), spectral hashing (SH),spherical hashing (SpH), binary autoencoder (BA), and scalable graph hashing (GH).L2NNS comparison on MNIST, SIFT-1M, GIST-1M, and SIFT-1B with varying binarycode lengths. Performance is based on Recall 10@M (fraction of top 10 ground truthneighbors in retrieved M) as M increases to 1000.

13 / 14

Reconstruction Visualization

Figure: Original and reconstructed samples using 64 bits for SGH and ITQ, and 64real components (64 × 32 bits) for PCA. Original MNIST image uses 28 × 28 × 8 bits.Original CIFAR-10 image uses 30 × 30 × 24 bits.

14 / 14

Stochastic Generative Hashing - Welcome to Duke ECEece.duke.edu/~lcarin/Ikenna9.8.2017.pdf ·...

Documents

Transcript of Stochastic Generative Hashing - Welcome to Duke ECEece.duke.edu/~lcarin/Ikenna9.8.2017.pdf ·...