Convex Point Estimation using Undirected Bayesian Transfer Hierarchies
-
Upload
madaline-leon -
Category
Documents
-
view
29 -
download
1
description
Transcript of Convex Point Estimation using Undirected Bayesian Transfer Hierarchies
Convex Point Estimation using Undirected Bayesian Transfer Hierarchies
Gal Elidan Ben Packer
Geremy Heitz Daphne Koller
Stanford AI Lab
MotivationProblem:
With few instances, learned
models aren’t robust
MEAN
PrincipalComponents
Task:
Shape modeling
std +1std -1
std +1
std -1
MEAN
Transfer Learning
MEAN
PrincipalComponents
Shape is stabilized, but doesn’t
look like an elephant
Can we use rhinos to help
elephants?
std +1std -1
std +1
std -1
MEAN
Hierarchical Bayes
root
Elephant Rhino
P(Elephant|root) P(Rhino|root)
P(DataElephant|Elephant) P(DataRhino|Rhino)
P(root)
Goals
Transfer between related classes Range of settings, tasks Probabilistic motivation Multilevel, complex hierarchies Simple, efficient computation Automatically learn what to
transfer
Hierarchical Bayes
root
Elephant Rhino
Compute full posterior P(£|D)
P(£c|£root) must be conjugate to P(D|£c)
Problem: Often can’t perform full Bayesian computations
Approx.: Point estimation
root
Elephant Rhino
Empirical Bayes Point estimation
Other approximations:
Posterior as normal, sampling, etc.
Best parameters are good enough;don’t need full distribution
More Issues: Multiple Levels
Conjugate priors usually can’t be extended to
multiple levels (e.g., Dirichlet, inverse-Wishart)Exception: Thibeaux and Jordan (‘05)
More Issues: Restrictive Priors
Example: inverse-Wishart Pseudocount
restriction >= d If d is large, N is
small, signal from prior overwhelms data
We show experiments with N=3, d=20
AA BB
Normal-Inverse-Wishart parameters
Gaussian parameters
N = # samples, d = dimension
Pseudocounts
Alternative: ShrinkageMcCallum et al. (‘98)1. Compute maximum likelihood at each
node2. “Shrink” each node toward its parent
Linear combination of and parent
Uses cross-validation of
pa
rent
Pros: Simple to compute Handles multiple levels
Cons: Naive heuristic for transfer Averaging not always
appropriate
parent
Undirected HB Reformulation
Defines an undirected Markov random field model over D
Probabilistic Abstraction Hierarchies (Segal et al. ‘01)
Fdata: Encourage parameters to explain data
Undirected Probabilistic Model
root
Fdata Fdata
Divergence
: high
Elephant Rhino
: low
Divergence: Encourage parameters to be similar to parents
Divergence
Purpose of Reformulation
Easy to specify Fdata can be likelihood, classification, or other
objective Divergence can be L1-distance, L2-distance, -insensitive loss, KL divergence, etc. No conjugacy or proper prior restrictions
Easy to optimize Convex over if Fdata is convex and Divergence is
concave
Task: Categorize Documents
Bag-of-words model
Fdata : Multinomial log likelihood (regularized)
µi represents frequency of word i
Divergence: L2 norm
Application: Text categorization
Newsgroup20Dataset
Baselines1. Maximum likelihood at each node (no hierarchy)
2. Cross-validate regularization (no hierarchy)
3. Shrinkage (McCallum et al. ’98, with hierarchy)
A B
parent
pa
rent
Can It Handle Multiple Levels?
75 150 225 300 375
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
Cla
ssif
icat
ion
Rat
e
Total Number of Training Instances
Newsgroup Topic Classification
Max Likelihood (No regularization)
Regularized Max LikelihoodShrinkage
Undirected HB
Task: Learn shape(Density estimation – test likelihood)Instances represented by 60 x-y coordinates of landmarks on outline
Divergence:L2 norm over mean and variance
Application: Shape Modeling
Mean landmark location
Covariance over landmarks
Regularization
MEAN
PrincipalComponents
Mammals Dataset (Fink, ’05)
Does Hierarchy Help?
6 10 20 30-350
-300
-250
-200
-150
-100
-50
0
50
Total Number of Training Instances
Del
ta l
og
-lo
ss /
in
stan
ceMammal Pairs
Regularized Max Likelihood
Bison-Rhino
Bison-RhinoElephant-BisonElephant-RhinoGiraffe-BisonGiraffe-ElephantGiraffe-RhinoLlama-BisonLlama-ElephantLlama-GiraffeLlama-Rhino
Unregularized max likelihood, shrinkage: Much worse, not shown
Split into subcomponents µi with weights
Allows for different strengths for different subcomponents, child-parent pairs
Degrees of Transfer
How do we estimate all these parameters?
Learning Degrees of Transfer
Bootstrap approachIf and have a consistent relationship, want to encourage them to be similar
Hyper-prior approachBayesian idea: Put prior on ¸ Add ¸ as parameter to optimization along with £
Concretely: inverse-Gamma prior (forced to be positive)
Prior on Degree of Transfer
If likelihood is concave, entire objective is convex!
Do Degrees of Transfer Help?
6 10 20 30-15
-10
-5
0
5
10
15
Total Number of Training Instances
Del
ta l
og
-lo
ss /
in
stan
ce
Mammal Pairs
Bison-RhinoElephant-BisonElephant-RhinoGiraffe-BisonGiraffe-ElephantGiraffe-RhinoLlama-BisonLlama-ElephantLlama-GiraffeLlama-Rhino
Regularized Max Likelihood
Bison-Rhino
Hyperprior
Degrees of Transfer
1/
Stronger transfer Weaker transfer
root
0 5 10 15 20 25 30 35 40 45 500
2
4
6
8
10
12
14
16
18
20
Distribution of DOT coefficients using Hyperprior
Summary
Transfer between related classes Range of settings, tasks Probabilistic motivation Multilevel, complex hierarchies Simple, efficient computation Refined transfer of components