Stochastic Generative Hashing - Welcome to Duke ECEece.duke.edu/~lcarin/Ikenna9.8.2017.pdf ·...
Transcript of Stochastic Generative Hashing - Welcome to Duke ECEece.duke.edu/~lcarin/Ikenna9.8.2017.pdf ·...
Stochastic Generative Hashing
B. Dai1, R. Guo2, S. Kumar2, N. He3 and L. Song1
1Georgia Institute of Technology, 2Google Research, NYC, 3University of Illinois at Urbana-Champaign
Discussion by Ikenna OdinakaSeptember 8, 2017
1 / 14
Outline
1 Introduction
2 Stochastic Generative Hashing
3 Distributional SGD
4 Experiments
2 / 14
Binary Hashing
Represent real-valued signals x ∈ Rd using binary vectors (hash codes)h ∈ 0,1l
Faster search and retrieval e.g. L2 Nearest Neighbor Search (L2NNS)Reduces time and storage requirementsSearch with binary vectors based on Hamming distance can be doneefficientlyData-dependent versus randomized hashingPrevious approaches to data-dependent hashing have two majorshortcomings:
Heuristically constructed objective functionBinary constraints are crudely handled by relaxation leading to subparresults
Stochastic Generative Hashing tackles both problems
3 / 14
Generative Model p(x |h)
p(x |h) can be chosen to based on the problem domainSimple Gaussian distribution for p(x |h) achieves state-of-the-artperformanceJoint distribution is given by
p(x , h) = p(x |h)p(h), (1)
where p(x |h) = N (Uh, ρ2I),p(h) ∼ B(θ) =
∏li=1 θ
hii (1− θi )
1−hi ,U = uil
i=1 , ui ∈ Rd is a codebook with l codewords,θ = [θi ]
li=1 ∈ [0, 1]l
The joint distribution can be written as
p(x , h) ∝ exp(−
12ρ2‖x − Uh‖2
2 + (logθ
1− θ)T h), (2)
4 / 14
Encoding Model q(h|x)
Computing the posterior p(h|x) is intractableGetting the MAP solution from p(h|x) requires integer programmingApproximate p(h|x) using q(h|x) as the encoder
q(h|x) =l∏
k=1
q(hk = 1|x)hk q(hk = 0|x)1−hk (3)
Use linear parametrization h = [hk ]lk=1 ∼ B(σ(W T x)), whereW = [wk ]lk=1 ∈ Rd×l
MAP solution of q(h|x) is
h(x) = argmaxhq(h|x) =sign(W T x) + 1
2
5 / 14
Training Objective
Objective function is based on Minimal Description Length (MDL)principleMDL principle tries to minimize the expected amount of information tocommunicate x
L(x) =∑
h
q(h|x)(L(h) + L(x |h)),
where L(h) = − log p(h) + log q(h|x) is the description length of the hash code h,L(x |h) = − log p(x |h) is the description length of x with h already known.
For a given bit size l , MDL principle finds the best parameters tomaximally compress the training dataTraining objective is
minΘ=W ,U,β,ρ
H(Θ) :=∑
xH(Θ; x) = −
∑x
∑h
q(h|x)(log p(x , h)− log q(h|x)), (4)
where β := log θ1−θ
W comes from q(h|x), while U, β, ρ come from p(x ,h)
The objective function is also called Helmholtz (variational) free energy6 / 14
Reparametrization via Stochastic Neuron
Computing gradients w.r.t. W requires back-propagating throughstochastic nodes of binary variable hREINFORCE can be used, but it suffers from high varianceREINFORCE + variance reduction techniques suffer from either beingbiased or expensive to computeReparametrize Bernoulli distribution using stochastic neuronhk (z) is reparameterized with z ∈ (0,1), k = 1, · · · , lThe stochastic neuron is defined as
h(z, ξ) :=
1 if z ≥ ξ0 if z < ξ
, (5)
where ξ ∼ U(0, 1)
h(z, ξ) ∼ B(z), because P(h(z, ξ) = 1) = zReplacing hk ∼ B(σ(wT
k x)) with hk (σ(wTk x), ξk ) in equation (4) gives
H(Θ) =∑
xH(Θ; x) :=
∑x
Eξ[`(h, x)
], (6)
where `(h, x) = − log p(x , h(σ(W T x), ξ)) + log q(h(σ(W T x), ξ)|x)
7 / 14
Theory
Proposition 1 (Neighborhood Preservation)
If ‖U‖F is bounded, then the Gaussian reconstruction error, ‖x − Uhx‖2 is a surrogate forEuclidean neighborhood preservation.
Definition 2 (Distributional derivative)
Let Ω ⊂ Rd be an open set, C∞0 (Ω) denote the space of functions that are infinitely differentiablewith compact support in Ω, and D′(Ω) be the space of continuous linear functionals on C∞0 (Ω).Let u ∈ D′(Ω), then a distribution v is called the distributional derivative of u, denoted as v = Du,if it satisfies ∫
Ωvφdx = −
∫Ω
u∂φdx , ∀φ ∈ C∞0 (Ω).
Lemma 3
For a given sample x , the distributional derivative of function H(Θ; x) w.r.t. W is given by
DW H(Θ; x) = Eξ[∆h`(h(σ(W T x), ξ))σ(W T x) • (1− σ(W T x))xT
], (7)
where • denotes point-wise product and ∆h`(h) denotes the finite difference defined as[∆h`(h)
]k
= `(h1k )− `(h0
k ), where[hi
k
]l
= hl if k 6= l , otherwise[hi
k
]l
= i, i ∈ 0, 1.
8 / 14
Distributional derivative of Stochastic Neuron
H(Θ; x) is not differentiable w.r.t. W because h(z; ξ) is discontinousAn unbiased stochastic estimator of the gradient of H at Θi , using samplexi , ξi , is given as
∇ΘH(Θi ; xi ) =[DW H(Θi ; xi ), ∇U,β,ρH(Θi ; xi )
](8)
The estimator of DW H(Θ; x) needs two forward passes of the model foreach dimension of Θ.An approximate distributional derivative DW H(Θ; x) can compute eachdimension with only one forward pass
DW H(Θ; x) := Eξ[∇h`(h(σ(W T x), ξ))σ(W T x) • (1− σ(W T x))xT
](9)
The approximate stochastic estimator of the gradient of H is given as
ˆ∇ΘH(Θi ; xi ) =[
ˆDW H(Θi ; xi ), ∇U,β,ρH(Θi ; xi )]
(10)
9 / 14
Distributional SGD Algorithm
10 / 14
Convergence of Distributional SGD
Proposition 4
The distributional derivative DW H(Θ; x) is equivalent to the traditional gradient ∇W H(Θ; x).
Theorem 5 (Convergence of Exact Distributional SGD)
Under the assumption that H is L-Lipschitz smooth and the variance of the stochasticdistributional gradient (8) is bounded by σ2 in the distributional SGD, for the solution ΘR sampled
from the trajectory Θiti=1 with probability P(R = i) =
2γi−Lγ2i∑t
i=1 2γi−Lγ2i
where γi ∼ O(1/√
t), we
have
E[‖∇ΘH(ΘR)‖2
]∼ O
(1√
t
).
Theorem 6 (Convergence of Approximate Distributional SGD)
Under the assumption that the variance of the approximate stochastic distributional gradient (10)is bounded by σ2, for the solution ΘR sampled from the trajectory Θit
i=1 with probabilityP(R = i) = γi∑t
i=1 γiwhere γi ∼ O(1/
√t), we have
E[(ΘR −Θ∗)T ∇ΘH(ΘR)
]∼ O
(1√
t
),
where Θ∗ denotes the optimal solution.
11 / 14
Reconstruction Error and Training Time
Figure: Comparison of Stochastic Generative Hashing (SGH) against iterativequantization (ITQ) and binary autoencoder (BA). Convergence of reconstruction errorwith number of SGD training samples. Training time comparison between BA and SGHon SIFT-1M dataset over different bit lengths
12 / 14
Large Scale Nearest Neighbor Retrieval
Figure: Comparison of SGH, ITQ, K -means hashing (KMH), spectral hashing (SH),spherical hashing (SpH), binary autoencoder (BA), and scalable graph hashing (GH).L2NNS comparison on MNIST, SIFT-1M, GIST-1M, and SIFT-1B with varying binarycode lengths. Performance is based on Recall 10@M (fraction of top 10 ground truthneighbors in retrieved M) as M increases to 1000.
13 / 14
Reconstruction Visualization
Figure: Original and reconstructed samples using 64 bits for SGH and ITQ, and 64real components (64 × 32 bits) for PCA. Original MNIST image uses 28 × 28 × 8 bits.Original CIFAR-10 image uses 30 × 30 × 24 bits.
14 / 14