Parameter learning in Markov nets

Graphical Models – 10708

Carlos Guestrin

Carnegie Mellon University

November 17th, 2008

Readings:K&F: 19.1, 19.2, 19.3, 19.4

10-708 – Carlos Guestrin 2006-2008

10-708 – Carlos Guestrin 2006-2008 2

Learning Parameters of a BN

Log likelihood decomposes:

Learn each CPT independently Use counts

10-708 – Carlos Guestrin 2006-2008 3

Log Likelihood for MN

Log likelihood of the data:

Difficulty

SATGrade

HappyJob

Coherence

Letter

Intelligence

10-708 – Carlos Guestrin 2006-2008 4

Log Likelihood doesn’t decompose for MNs Log likelihood:

A concave problem Can find global optimum!!

Term log Z doesn’t decompose!!

Difficulty

SATGrade

HappyJob

Coherence

Letter

Intelligence

10-708 – Carlos Guestrin 2006-2008 5

Derivative of Log Likelihood for MNs

Difficulty

SATGrade

HappyJob

Coherence

Letter

Intelligence

10-708 – Carlos Guestrin 2006-2008 6

Derivative of Log Likelihood for MNs 2

Difficulty

SATGrade

HappyJob

Coherence

Letter

Intelligence

10-708 – Carlos Guestrin 2006-2008 7

Derivative of Log Likelihood for MNs

Difficulty

SATGrade

HappyJob

Coherence

Letter

Intelligence Derivative:

Computing derivative requires inference:

Can optimize using gradient ascent Common approach Conjugate gradient, Newton’s method,…

Let’s also look at a simpler solution

10-708 – Carlos Guestrin 2006-2008 8

Iterative Proportional Fitting (IPF)

Difficulty

SATGrade

HappyJob

Coherence

Letter

Intelligence

Setting derivative to zero:

Fixed point equation:

Iterate and converge to optimal parameters Each iteration, must compute:

10-708 – Carlos Guestrin 2006-2008 9

Log-linear Markov network(most common representation)

Feature is some function [D] for some subset of variables D e.g., indicator function

Log-linear model over a Markov network H: a set of features 1[D1],…, k[Dk]

each Di is a subset of a clique in H two ’s can be over the same variables

a set of weights w1,…,wk

usually learned from data

10-708 – Carlos Guestrin 2006-2008 10

Learning params for log linear models – Gradient Ascent

Log-likelihood of data:

Compute derivative & optimize usually with conjugate gradient ascent

Derivative of log-likelihood 1 – log-linear models

10-708 – Carlos Guestrin 2006-2008 11

Derivative of log-likelihood 2 – log-linear models

10-708 – Carlos Guestrin 2006-2008 12

Learning log-linear models with gradient ascent

Gradient:

Requires one inference computation per

Theorem: w is maximum likelihood solution iff

Usually, must regularize E.g., L2 regularization on parameters

10-708 – Carlos Guestrin 2006-2008 13

10-708 – Carlos Guestrin 2006-2008 14

What you need to know about learning MN parameters?

BN parameter learning easy MN parameter learning doesn’t decompose!

Learning requires inference!

Apply gradient ascent or IPF iterations to obtain optimal parameters

Conditional Random Fields

Carlos Guestrin

November 17th, 2008

Readings:K&F: 19.3.2

Generative v. Discriminative classifiers – A review

Want to Learn: h:X Y X – features Y – target classes

Bayes optimal classifier – P(Y|X) Generative classifier, e.g., Naïve Bayes:

Assume some functional form for P(X|Y), P(Y) Estimate parameters of P(X|Y), P(Y) directly from training data Use Bayes rule to calculate P(Y|X= x) This is a ‘generative’ model

Indirect computation of P(Y|X) through Bayes rule But, can generate a sample of the data, P(X) = y P(y) P(X|y)

Discriminative classifiers, e.g., Logistic Regression: Assume some functional form for P(Y|X) Estimate parameters of P(Y|X) directly from training data This is the ‘discriminative’ model

Directly learn P(Y|X) But cannot obtain a sample of the data, because P(X) is not available

1610-708 – Carlos Guestrin 2006-2008

10-708 – Carlos Guestrin 2006-2008 17

Log-linear CRFs(most common representation)

Graph H: only over hidden vars Y1,..,Yn

No assumptions about dependency on observed vars X You must always observe all of X

Feature is some function [D] for some subset of variables D e.g., indicator function,

Log-linear model over a CRF H: a set of features 1[D1],…, k[Dk]

each Di is a subset of a clique in H two ’s can be over the same variables

a set of weights w1,…,wk

usually learned from data

10-708 – Carlos Guestrin 2006-2008 18

Learning params for log linear CRFs – Gradient Ascent

Log-likelihood of data:

Compute derivative & optimize usually with conjugate gradient ascent

Learning log-linear CRFs with gradient ascent

Gradient:

Requires one inference computation per

Usually, must regularize E.g., L2 regularization on parameters

10-708 – Carlos Guestrin 2006-2008 19

10-708 – Carlos Guestrin 2006-2008 20

What you need to know about CRFs

Discriminative learning of graphical models Fewer assumptions about distribution often

performs better than “similar” MN

Gradient computation requires inference per datapoint Can be really slow!!

EM for BNs

Carlos Guestrin

November 17th, 2008

Readings: 18.1, 18.2

10-708 – Carlos Guestrin 2006-2008 22

Thus far, fully supervised learning

We have assumed fully supervised learning:

Many real problems have missing data:

10-708 – Carlos Guestrin 2006-2008 23

The general learning problem with missing data

Marginal likelihood – x is observed, z is missing:

10-708 – Carlos Guestrin 2006-2008 24

E-step

x is observed, z is missing Compute probability of missing data given current choice of

Q(z|x(j)) for each x(j) e.g., probability computed during classification step corresponds to “classification step” in K-means

10-708 – Carlos Guestrin 2006-2008 25

Jensen’s inequality

Theorem: log z P(z) f(z) ≥ z P(z) log f(z)

10-708 – Carlos Guestrin 2006-2008 26

Applying Jensen’s inequality

Use: log z P(z) f(z) ≥ z P(z) log f(z)

10-708 – Carlos Guestrin 2006-2008 27

The M-step maximizes lower bound on weighted data

Lower bound from Jensen’s:

10-708 – Carlos Guestrin 2006-2008 28

The M-step

Maximization step:

Use expected counts instead of counts: If learning requires Count(x,z) Use EQ(t+1)[Count(x,z)]

10-708 – Carlos Guestrin 2006-2008 29

Convergence of EM

Define potential function F(,Q):

EM corresponds to coordinate ascent on F Thus, maximizes lower bound on marginal log likelihood As seen in Machine Learning class last semester

Parameter learning in Markov nets

Documents

Transcript of Parameter learning in Markov nets

Class Separation and Parameter Estimation with Neural Nets for the XEUS Project

IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 58, NO. 8, …ee.usc.edu/stochastic-nets/docs/markov-nets.pdf · gramming and Markov Decision theory. Approximate dynamic programming

Discriminative Structure and Parameter Learning for Markov Logic Networks

Hidden Markov Models - Northwestern Engineeringddowney/courses/474_Fall2017/hmms.pdf · Hidden Markov Models as Bayes Nets ... 1 according to prob e 1 (x 1) 3. Go to state ...

DREAM D : an adaptive Markov Chain Monte Carlo …(D): an adaptive Markov Chain Monte Carlo simulation algorithm to solve discrete, noncontinuous, and combinatorial posterior parameter

Technische Hochschule Brandenburg Modulkatalog des ... · auxiliary processes (Finite automa, Petri nets, Markov chains) • Object-oriented methods for process modelling • Current

Undirected Probabilistic Graphical Models (Markov Nets)

Lecture notes on Regression: Markov Chain Monte Carlo (MCMC) · PDF fileFigure 5: Markov Chain Monte Carlo analysis for one ﬁtting parameter. There are two phases for each walker

422 big picture: Where are we? Query Planning DeterministicStochastic Value Iteration Approx. Inference Full Resolution SAT Logics Belief Nets Markov Decision.

Markov Nets - Carnegie Mellon University

Markov Nets

Intro to Machine Learning Parameter Estimation for Bayes Nets Naïve Bayes.

Parameter Estimation for Hidden Markov Models with Intractable ...

Planning: Heuristics and CSP Planningcarenini/TEACHING/CPSC322-17S/...Belief Nets Vars + Constraints Decision Nets Markov Processes Var. Elimination Static Sequential Representation

Alpha Nets - A Recurrent Neural Network Architecture With a Hidden Markov Model (HMM) Interpretation

Parameter Estimation for Hidden Markov models · Outline Introduction Hidden Markov Models Approximate Bayesian Computation Noisy ABC Summary Introduction II will describe a theoretical

Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.

DSoS - Newcastle Universityresearch.cs.ncl.ac.uk/cabernet/formal modeling techniques (e.g., process algebra, Markov chains, Petri nets, queuing nets) from developers. As detailed in

Markov chain Monte Carlo algorithms for SDE parameter ...

ML Parameter Estimation for Markov Random Fields, with ...bouman/publications/pdf/tip10.pdf · ML Parameter Estimation for Markov Random Fields, with Applications to Bayesian Tomography