Spring 2014 Course
description
Transcript of Spring 2014 Course
![Page 1: Spring 2014 Course](https://reader033.fdocuments.in/reader033/viewer/2022051422/56816761550346895ddc3695/html5/thumbnails/1.jpg)
Announcements
Spring Courses Somewhat Relevant to Machine Learning 5314: Algorithms for molecular bio (who’s teaching?) 5446: Chaotic dynamics (Bradley) 5454: Algorithms (Frangillo) 5502: Data mining (Lv) 5753: Computer performance modeling (Grunwald) 7000-006: Geospatial data analysis (Caleb Phillips) 7000-008: Human-robot interaction (Dan Szafir) 7000-009: Data analytics: Systems algorithms and applications (Lv) 7000-021: Bioinformatics (Robin Dowell-Dean)
Homework
Importance sampling vialikelihood weighting
![Page 2: Spring 2014 Course](https://reader033.fdocuments.in/reader033/viewer/2022051422/56816761550346895ddc3695/html5/thumbnails/2.jpg)
Learning In Bayesian Networks:Missing Data And Hidden Variables
![Page 3: Spring 2014 Course](https://reader033.fdocuments.in/reader033/viewer/2022051422/56816761550346895ddc3695/html5/thumbnails/3.jpg)
Missing Vs. Hidden Variables
Missing
often known but absent for certain data points
missing at random or missing based on value e.g., netflix ratings
Hidden
never observed but essential for predicting visible variables e.g., human memory state
a.k.a. latent variables
![Page 4: Spring 2014 Course](https://reader033.fdocuments.in/reader033/viewer/2022051422/56816761550346895ddc3695/html5/thumbnails/4.jpg)
Quiz
“Semisupervised learning” concerns learning where additional input examples are available, but labels are not. According to the model below, will partial data (either X or Y) inform the model parameters?
X known?Y known?
X Y
θy|xθx θy|~x
XX Y
![Page 5: Spring 2014 Course](https://reader033.fdocuments.in/reader033/viewer/2022051422/56816761550346895ddc3695/html5/thumbnails/5.jpg)
X Y
θy|xθx θy|~x
X Y
![Page 6: Spring 2014 Course](https://reader033.fdocuments.in/reader033/viewer/2022051422/56816761550346895ddc3695/html5/thumbnails/6.jpg)
Missing Data: Exact Inference In Bayes Net
Y: observed variablesZ: unobserved variables
How do we do parameter updates for θi in this case?
If Xi and Pai are observed, then situation is straightforward (e.g., like single-coin toss case).
If Xi or any Pai are missing, need to marginalize over Z
E.g., Xi ~ Categorical(θij)
Note: posterior is a Dirichlet mixture
Dirichlet
# values of Xi
Specific value of Xi
Dirichlet
X = {Y,Z}
parameter vector for Xiwith parent configuration j
![Page 7: Spring 2014 Course](https://reader033.fdocuments.in/reader033/viewer/2022051422/56816761550346895ddc3695/html5/thumbnails/7.jpg)
Missing Data: Gibbs Sampling
Given a set of observed incomplete data, D = {y1, ..., yN}
1. Fill in arbitrary values for unobserved variables for each case Dc
2. For each unobserved variable zi in case n, sample:
3. evaluate posterior density on complete data Dc’
4. repeat steps 2 and 3, and compute mean of posterior density
![Page 8: Spring 2014 Course](https://reader033.fdocuments.in/reader033/viewer/2022051422/56816761550346895ddc3695/html5/thumbnails/8.jpg)
Missing Data: Gaussian Approximation
Approximateas a multivariate Gaussian.
Appropriate if sample size |D| is large, which is also the case when Monte Carlo is inefficient
1. find the MAP configuration by maximizing g(.)
2. approximate using 2nd degree Taylor polynomial
3. leads to approximate result that is Gaussian
~
negative Hessian of g(.) eval at
~
![Page 9: Spring 2014 Course](https://reader033.fdocuments.in/reader033/viewer/2022051422/56816761550346895ddc3695/html5/thumbnails/9.jpg)
Missing Data: Further Approximations
As the data sample size increases, Gaussian peak becomes sharper, so can make predictions
based on the MAP configuration can ignore priors (diminishing importance) -> max likelihood
How to do ML estimation Expectation Maximization Gradient methods
![Page 10: Spring 2014 Course](https://reader033.fdocuments.in/reader033/viewer/2022051422/56816761550346895ddc3695/html5/thumbnails/10.jpg)
Expectation Maximization
Scheme for picking values of missing data and hidden variables that maximizes data likelihood
E.g., population of Laughing Goat
baby stroller, diapers, lycra pants
backpack, saggy pants
baby stroller, diapers
backpack, computer, saggy pants
diapers, lycra
computer, saggy pants
backpack, saggy pants
![Page 11: Spring 2014 Course](https://reader033.fdocuments.in/reader033/viewer/2022051422/56816761550346895ddc3695/html5/thumbnails/11.jpg)
Expectation Maximization Formally
V: visible variables
H: hidden variables
θ: model parameters Model
P(V,H|θ) Goal
Learn model parameters θ in the absence of H Approach
Find θ that maximizes P(V|θ)
![Page 12: Spring 2014 Course](https://reader033.fdocuments.in/reader033/viewer/2022051422/56816761550346895ddc3695/html5/thumbnails/12.jpg)
EM Algorithm (Barber, Chapter 11)
![Page 13: Spring 2014 Course](https://reader033.fdocuments.in/reader033/viewer/2022051422/56816761550346895ddc3695/html5/thumbnails/13.jpg)
EM Algorithm
Guaranteed to find local optimum of θ Sketch of proof
Bound on marginal likelihood
equality only when q(h|v)=p(h|v,θ)
E-step: for fixed θ, find q(h|v) that maximizes RHS
M-step: for fixed q, find θ that maximizes RHS
if each step maximizes RHS, it’s also improving LHS technically, it’s not lowering LHS
![Page 14: Spring 2014 Course](https://reader033.fdocuments.in/reader033/viewer/2022051422/56816761550346895ddc3695/html5/thumbnails/14.jpg)
Barber Example
Contours are of the lower boundNote alternating steps along θ and q axes
note that steps are not gradient steps and can be largeChoice of initial θ determines local likelihood optimum
![Page 15: Spring 2014 Course](https://reader033.fdocuments.in/reader033/viewer/2022051422/56816761550346895ddc3695/html5/thumbnails/15.jpg)
Clustering: K-Means Vs. EM
K means
1. choose some initial values of μk
2. assign each data point to the closest cluster
3. recalculate the μk to be the means of the set of points assigned to cluster k
4. iterate to step 2
![Page 16: Spring 2014 Course](https://reader033.fdocuments.in/reader033/viewer/2022051422/56816761550346895ddc3695/html5/thumbnails/16.jpg)
K-means Clustering
From C. Bishop, Pattern Recognition and Machine Learning
![Page 17: Spring 2014 Course](https://reader033.fdocuments.in/reader033/viewer/2022051422/56816761550346895ddc3695/html5/thumbnails/17.jpg)
K-means Clustering
![Page 18: Spring 2014 Course](https://reader033.fdocuments.in/reader033/viewer/2022051422/56816761550346895ddc3695/html5/thumbnails/18.jpg)
K-means Clustering
![Page 19: Spring 2014 Course](https://reader033.fdocuments.in/reader033/viewer/2022051422/56816761550346895ddc3695/html5/thumbnails/19.jpg)
K-means Clustering
![Page 20: Spring 2014 Course](https://reader033.fdocuments.in/reader033/viewer/2022051422/56816761550346895ddc3695/html5/thumbnails/20.jpg)
Clustering: K-Means Vs. EM
K means
1. choose some initial values of μk
2. assign each data point to the closest cluster
3. recalculate the μk to be the means of the set of points assigned to cluster k
4. iterate to step 2
![Page 21: Spring 2014 Course](https://reader033.fdocuments.in/reader033/viewer/2022051422/56816761550346895ddc3695/html5/thumbnails/21.jpg)
Clustering: K-Means Vs. EM
EM
1. choose some initial values of μk
2. probabilistically assign each data point to clusters1. P(Z=k|μ)
3. recalculate the μk to be the weighted mean of the set of points
1. weight by P(Z=k|μ)
4. iterate to step 2
![Page 22: Spring 2014 Course](https://reader033.fdocuments.in/reader033/viewer/2022051422/56816761550346895ddc3695/html5/thumbnails/22.jpg)
EM for Gaussian Mixtures
![Page 23: Spring 2014 Course](https://reader033.fdocuments.in/reader033/viewer/2022051422/56816761550346895ddc3695/html5/thumbnails/23.jpg)
EM for Gaussian Mixtures
![Page 24: Spring 2014 Course](https://reader033.fdocuments.in/reader033/viewer/2022051422/56816761550346895ddc3695/html5/thumbnails/24.jpg)
EM for Gaussian Mixtures
![Page 25: Spring 2014 Course](https://reader033.fdocuments.in/reader033/viewer/2022051422/56816761550346895ddc3695/html5/thumbnails/25.jpg)
Variational Bayes
Generalization of EM
also deals with missing data and hidden variablesProduces posterior on parameters
not just ML solutionBasic (0th order) idea
do EM to obtain estimates of p(θ) rather than θ directly
![Page 26: Spring 2014 Course](https://reader033.fdocuments.in/reader033/viewer/2022051422/56816761550346895ddc3695/html5/thumbnails/26.jpg)
Variational Bayes
Assume factorized approximation of joint hidden and parameter posterior:
Find marginals that make this approximation as close as possible.
Advantage?
Bayesian Occam’s razor: vaguely specified parameter is a simpler model -> reduces overfitting
![Page 27: Spring 2014 Course](https://reader033.fdocuments.in/reader033/viewer/2022051422/56816761550346895ddc3695/html5/thumbnails/27.jpg)
Gradient Methods
Useful for continuous parameters θMake small incremental steps to maximize the likelihood
Gradient update:
swap
![Page 28: Spring 2014 Course](https://reader033.fdocuments.in/reader033/viewer/2022051422/56816761550346895ddc3695/html5/thumbnails/28.jpg)
All Learning Methods Apply ToArbitrary Local Distribution Functions
Local distribution function performs either Probabilistic classification (discrete RVs) Probabilistic regression (continuous RVs)
Complete flexibility in specifying local distribution fn Analytical function (e.g., homework 5) Look up table Logistic regression Neural net Etc.
LOCAL DISTRIBUTION FUNCTION
![Page 29: Spring 2014 Course](https://reader033.fdocuments.in/reader033/viewer/2022051422/56816761550346895ddc3695/html5/thumbnails/29.jpg)
Summary Of Learning Section
Given model structure and probabilities,inferring latent variables
Given model structure,learning model probabilities Complete data Missing data
Learning model structure
![Page 30: Spring 2014 Course](https://reader033.fdocuments.in/reader033/viewer/2022051422/56816761550346895ddc3695/html5/thumbnails/30.jpg)
Learning Model Structure
![Page 31: Spring 2014 Course](https://reader033.fdocuments.in/reader033/viewer/2022051422/56816761550346895ddc3695/html5/thumbnails/31.jpg)
Learning Structure and Parameters
The principleTreat network structure, Sh, as a discrete RV
Calculate structure posterior
Integrate over uncertainty in structure to predict
The practiceComputing marginal likelihood, p(D|Sh), can be difficult.
Learning structure can be impractical due to the large number of hypotheses (more than exponential in # of nodes)
![Page 32: Spring 2014 Course](https://reader033.fdocuments.in/reader033/viewer/2022051422/56816761550346895ddc3695/html5/thumbnails/32.jpg)
source: www.bayesnets.com
![Page 33: Spring 2014 Course](https://reader033.fdocuments.in/reader033/viewer/2022051422/56816761550346895ddc3695/html5/thumbnails/33.jpg)
Approach to Structure Learning
model selection find a good model, and treat it as the correct model
selective model averaging select a manageable number of candidate models and pretend that these models are exhaustive
Experimentally, both of these approaches produce good results.
i.e., good generalization
![Page 34: Spring 2014 Course](https://reader033.fdocuments.in/reader033/viewer/2022051422/56816761550346895ddc3695/html5/thumbnails/34.jpg)
![Page 35: Spring 2014 Course](https://reader033.fdocuments.in/reader033/viewer/2022051422/56816761550346895ddc3695/html5/thumbnails/35.jpg)
SLIDES STOLEN FROM DAVID HECKERMAN
![Page 36: Spring 2014 Course](https://reader033.fdocuments.in/reader033/viewer/2022051422/56816761550346895ddc3695/html5/thumbnails/36.jpg)
![Page 37: Spring 2014 Course](https://reader033.fdocuments.in/reader033/viewer/2022051422/56816761550346895ddc3695/html5/thumbnails/37.jpg)
Interpretation of Marginal Likelihood
Using chain rule for probabilities
Maximizing marginal likelihood also maximizes sequential prediction ability!
Relation to leave-one-out cross validation
Problems with cross validation can overfit the data, possibly because of interchanges (each item is used for
training and for testing each other item) has a hard time dealing with temporal sequence data
![Page 38: Spring 2014 Course](https://reader033.fdocuments.in/reader033/viewer/2022051422/56816761550346895ddc3695/html5/thumbnails/38.jpg)
Coin Example
![Page 39: Spring 2014 Course](https://reader033.fdocuments.in/reader033/viewer/2022051422/56816761550346895ddc3695/html5/thumbnails/39.jpg)
αh, αt, #h, and #t all indexed by these conditions
![Page 40: Spring 2014 Course](https://reader033.fdocuments.in/reader033/viewer/2022051422/56816761550346895ddc3695/html5/thumbnails/40.jpg)
# parent config
# nodes
# node states
![Page 41: Spring 2014 Course](https://reader033.fdocuments.in/reader033/viewer/2022051422/56816761550346895ddc3695/html5/thumbnails/41.jpg)
Computation of Marginal Likelihood
Efficient closed form solution if
no missing data (including no hidden variables)
mutual independence of parameters θ
local distribution functions from the exponential family (binomial, Poisson, gamma, Gaussian, etc.)
conjugate priors
![Page 42: Spring 2014 Course](https://reader033.fdocuments.in/reader033/viewer/2022051422/56816761550346895ddc3695/html5/thumbnails/42.jpg)
Computation of Marginal Likelihood
Approximation techniques must be used otherwise.E.g., for missing data can use Gibbs sampling or Gaussian approximation described earlier.
Bayes theorem
1. Evaluate numerator directly, estimate denominator using Gibbs sampling
2. For large amounts of data, numerator can be approximated by a multivariate Gaussian
![Page 43: Spring 2014 Course](https://reader033.fdocuments.in/reader033/viewer/2022051422/56816761550346895ddc3695/html5/thumbnails/43.jpg)
Structure Priors
Hypothesis equivalence identify equivalence class of a given network structure
All possible structures equally likelyPartial specification: required and prohibited arcs(based on causal knowledge)
Ordering of variables + independence assumptions ordering based on e.g., temporal precedence presence or absence of arcs are mutually independent ->n(n-1)/2 priors
p(m) ~ similarity(m, prior Belief Net)
![Page 44: Spring 2014 Course](https://reader033.fdocuments.in/reader033/viewer/2022051422/56816761550346895ddc3695/html5/thumbnails/44.jpg)
Parameter Priors
all uniform: Beta(1,1)use a prior Belief Net
parameters dependonly on local structure
![Page 45: Spring 2014 Course](https://reader033.fdocuments.in/reader033/viewer/2022051422/56816761550346895ddc3695/html5/thumbnails/45.jpg)
Model Search
Finding the belief net structure with highest score among those structures with at most k parents is NP-hard for k > 1 (Chickering, 1995)
Sequential search add, remove, reverse arcs ensure no directed cycles efficient in that changes to arcs affect only
some components of p(D|M)
Heuristic methods greedy greedy with restarts MCMC / simulated annealing
![Page 46: Spring 2014 Course](https://reader033.fdocuments.in/reader033/viewer/2022051422/56816761550346895ddc3695/html5/thumbnails/46.jpg)
![Page 47: Spring 2014 Course](https://reader033.fdocuments.in/reader033/viewer/2022051422/56816761550346895ddc3695/html5/thumbnails/47.jpg)
![Page 48: Spring 2014 Course](https://reader033.fdocuments.in/reader033/viewer/2022051422/56816761550346895ddc3695/html5/thumbnails/48.jpg)
two most likely structures
![Page 49: Spring 2014 Course](https://reader033.fdocuments.in/reader033/viewer/2022051422/56816761550346895ddc3695/html5/thumbnails/49.jpg)
![Page 50: Spring 2014 Course](https://reader033.fdocuments.in/reader033/viewer/2022051422/56816761550346895ddc3695/html5/thumbnails/50.jpg)
2x1010
![Page 51: Spring 2014 Course](https://reader033.fdocuments.in/reader033/viewer/2022051422/56816761550346895ddc3695/html5/thumbnails/51.jpg)
![Page 52: Spring 2014 Course](https://reader033.fdocuments.in/reader033/viewer/2022051422/56816761550346895ddc3695/html5/thumbnails/52.jpg)
![Page 53: Spring 2014 Course](https://reader033.fdocuments.in/reader033/viewer/2022051422/56816761550346895ddc3695/html5/thumbnails/53.jpg)
![Page 54: Spring 2014 Course](https://reader033.fdocuments.in/reader033/viewer/2022051422/56816761550346895ddc3695/html5/thumbnails/54.jpg)
![Page 55: Spring 2014 Course](https://reader033.fdocuments.in/reader033/viewer/2022051422/56816761550346895ddc3695/html5/thumbnails/55.jpg)