Efficient Feature Learning Using Perturb-and-MAPkeli/publications/pandm_nips13_poster.pdf ·...
Transcript of Efficient Feature Learning Using Perturb-and-MAPkeli/publications/pandm_nips13_poster.pdf ·...
• Perturb-and-MAP [1] has been shown to be effective for pairwise MRFs, yet its application to other kinds of graphical models has been limited.
• We demonstrate that Perturb-and-MAP is effective at learning features using graphical models with complex dependencies between variables.
• We also propose a method of designing perturbations so that the distribution induced by Perturb-and-MAP better approximates the Gibbs distribution.
Abstract
Efficient Feature Learning Using Perturb-and-MAP Ke Li, Kevin Swersky and Richard Zemel
• The cardinality restricted Boltzmann machine (CaRBM) enforces a sparsity constraint over hidden units:
P 𝐯, 𝐡 =1
𝑍exp(𝐡T𝑊𝐯+ 𝐛T𝐡 + 𝐜T𝐯) ψ𝑘( ℎ𝑖𝑖 )
where ψ𝑘 𝑥 = 1 if x ≤ k and 0 otherwise.
• Training requires sampling from P 𝐡 | 𝐯 , which is non-trivial as the hidden units are not conditionally independent from each other.
• Swersky et al. [2] proposed a method to compute P 𝐡 | 𝐯 using message passing in O(𝑘𝐹) time.
• Using Perturb-and-MAP, if the input to each hidden unit is perturbed with Logistic(0,1) noise and MAP is performed using a selection algorithm, an approximate sample can be drawn in O(𝐹) time.
• We found the features learned by Perturb-and-MAP has a greater discriminative capability.
Cardinality RBM
• Many tasks involve predicting the correct matching in a bipartite graph, like image stitching, stereo reconstruction and video tracking.
• Our aim is to learn a descriptor for image patches that is tailored to matching key points across images.
• Our bipartite matching model is characterized by:
P 𝑀; θ =1
𝑍exp −
1
2𝑁 𝑚𝑖𝑗𝑖,𝑗 ϕ 𝑥𝑖; θ − ϕ 𝑥′𝑗; θ 2
2·
ψ( 𝑚𝑖𝑗)𝑖𝑗 ψ( 𝑚𝑖𝑗)𝑗𝑖
where ψ(x) = 1 if x = 1 and 0 otherwise, 𝑚𝑖𝑗 = 1 if ith
and jth key points match and 0 otherwise. • Training requires estimating an expectation over 𝑀
using a sample from P 𝑀; θ .
• As computing the partition function of P 𝑀; θ is #P-hard, sampling from P 𝑀; θ is challenging.
• If the model is perturbed with noise from the right distribution, approximate samples can be drawn in O(𝑁3) time using the Hungarian algorithm.
Bipartite Matching
• If the negative energy of each joint configuration is perturbed with i.i.d. Gumbel(0,1) noise, exact samples can be drawn from the Gibbs distribution using Perturb-and-MAP.
• In practice, reduced-order perturbation must be used to ensure tractability. As a result, negative perturbed energies across joint configurations are no longer independent or Gumbel-distributed. We propose a way of designing perturbations so that the latter property is preserved.
• The negative perturbed energy of each joint
configuration is distributed according to the sum of individual perturbations.
• We find a distribution D(1) using numerical deconvolution that satisfies the following property: If 𝑋~𝐺𝑢𝑚𝑏𝑒𝑙 0,1 ⊥ 𝑌~𝐷(1), 𝑋 + 𝑌~𝐺𝑢𝑚𝑏𝑒𝑙(0,2).
• Define D(s) as a scaled version of D(1). Then if
𝑋~𝐺𝑢𝑚𝑏𝑒𝑙 0, 2− 𝑁−1 ⊥ 𝑌𝑖~𝐷 2−𝑖 ∀𝑖 ∈ *1, … , 𝑁 −
1+, 𝑋 + 𝑌𝑖 ~𝐺𝑢𝑚𝑏𝑒𝑙(0,1). Thus, by perturbing the model with noise from the above distributions, the negative energy of each joint configuration is guaranteed to follow a Gumbel(0,1) distribution.
Designing Perturbations
Figure 1: The pdfs of Gumbel(0,1) and D(1)
Figure 2b: Comparison of prediction errors
Figure 4: Comparison of test error rates
Figure 2a: Comparison of reconstruction errors
Ongoing Research
• We are exploring ways of combining D-perturbations to obtain perturbations with equal entropy while ensuring the negative perturbed energies are approximately Gumbel-distributed.
• We are also investigating how closely the empirical marginals over configurations produced using different perturbation methods approximate the underlying Gibbs distribution.
Figure 3: Two frames and ground truth matching from dataset
• Perturb-and-MAP is an approximate sampling method that leverages existing optimization algorithms for performing MAP inference.
• It works by perturbing potentials with random noise,
and then performing MAP inference on the model with perturbed potentials.
• It relies on the following fact: If 𝜖1,…,𝜖𝑛 ~ iid Gumbel(0,1), then
𝑃 𝑎𝑘 + 𝜖𝑘 = max𝑖 𝑎𝑖 + 𝜖𝑖 =exp(𝑎𝑘)
exp(𝑎𝑖)𝑖
• If the energy of each joint configuration is perturbed,
Perturb-and-MAP yields an exact sample.
• In a pairwise MRF, perturbing unary and pairwise potentials has been shown empirically to produce similar results as perturbing each joint configuration.
Perturb-and-MAP
References [1] George Papandreou and Alan L. Yuille (2011). Perturb-and-MAP random fields: Using discrete optimization to learn and sample from energy models, ICCV. [2] Kevin Swersky, Danny Tarlow, Ilya Sutskever, Ruslan Salakhutdinov, Rich Zemel, and Ryan Adams (2012). Cardinality restricted boltzmann machines, NIPS 25.
{keli,kswersky,zemel}@cs.toronto.edu