P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla...

87
P(h i ) is called the hypothesis prior hing special about “learning” – just vanilla probabilistic inferen
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    215
  • download

    0

Transcript of P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla...

Page 1: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.

P(hi ) is called the hypothesis prior

Nothing special about “learning” – just vanilla probabilistic inference

Page 2: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.

Where is the hypothesis prior?

Page 3: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.

How did this prediction come about? Which hypothesis did we use?

Page 4: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.

The analogy with diagnosis

Medical diagnosis• Given symptoms of a patient, predict

whether she will have other symptoms (such as death…)

• Can try predicting directly from symptoms (is what we did before the advent of medicine)

• But we normally assume that diseases cause symptoms.. Thus we want to first figure out the disease and then predict other symptoms

• Diseases have prior probabilities (in fact, the “ignored prior” fallacy is the main reason for internet induced hypochondria)

• Given the symptoms, we compute the posterior on the diseases, and then use that to predict other symptoms

Full Bayesian learning• Given training data, predict test data• Can try predicting test data directly

from training data (e.g. k-NN)• But we normally assume that

hypothesis explain data. Thus we want to first figure out the hypotheses causing the data and then using them predict test data

• Hypotheses have prior probabilities (as to how likely they are—independent of the data being seen right now).

• Given the data, we compute the posterior on the hypotheses, and then use that to predict test data

Page 5: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.

11/15

Page 6: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.

Density Estimation(as the general objective of Statistical Learning)

• Given data D whose instances are made-up of attributes x that are distributed according to P*(x), we want to learn an estimate P’ to P* such that the distance between P* and P’ is minimized

• Once we have P’, we can use it for (i) prediction (ii) completion (iii) generation

• We need to decide how to represent P’ – We shall assume graphical models

• Bayes Networks or Markov Networks

• Often x can be partitioned into X (the “input attributes”) and Y (the “output attributes”)– In such a case, rather than learn P*(x), we might want to learn P*(Y|X)– If we do this, then we are doing discriminative learning (as against

generative learning)

Page 7: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.

P(hi ) is called the hypothesis prior

Nothing special about “learning” – just vanilla probabilistic inference

Note that for densityEstimation, H is theDensity and P(H) is theDensity over densities! For parametric case P(H) would be a distributionOver the parameters E.g. If we believe that the Hypothesis is a normalDistribution, then P(H) isa distribution over meanAnd variance of the normal For non-parametric casethings are much harder as you need the P(H) to beA distribution over infinite parameters (the recent advances on gaussian processes are aimed at this)

Two problems: 1. need to representP(H) during learning2. Need to reason withP(H) during inference MAP does away with 2 while MLE does awayWith both 1 and 2

Page 8: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.

How many

Why should P(hi ) be low for complex hypotheses? --connection to MDL principle

Equivalently minimize - log P(d|hi) – log P(hi)

Bits required to specify hi

Additional bits required to specify d

Need to represent both P(x|h) and P(h|D)

We can no longer quantify the variance of our prediction What if hMAP is just a little more likely than the next best and they predict different results?

Page 9: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.

--because “statisticians” distrust priors (and want the data to speak for itself)

When will ML hit roadblock? Small data

Not only can’t we quantify the variance of our prediction, but we also can fall prey to over fitting For example, the likelihood of data will never reduce if we add more links into the bayes network. So we will wind up learning fully connected networks!

Page 10: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.

Bayesians vs. Frequentists (The Religious Wars)

Frequentist Learning• Probabilities are

“asymptotic frequencies”

• There is a TRUE hypothesis

– So Hypothesis is not a random variable; and can’t have a distribution

• Having a prior is like “cheating”; being prejudiced;

– Let the data speak for itself!

Bayesian Learning• Probabilities are

“degrees of beliefs”• Hypothesis is a

random variable• So can have a prior

and a posterior• The hypothesis Prior is the

agent’s belief about which hypotheses are more vs. less likely

Probability that Einstein drank a cup of tea at 4:13pm on Feb 26, 1920?

“I believe it is 0.4”

Can’t say!

God Doesn’t Play dice with the universe

Stop telling Godwhat to do!

Both camps agree on MLE (but for differrent reasons)

MLE is just stripped down special case of BayesianLearning for large data

Good that you got out of hypothesis prior.Just don’t say P(D|h); it is P(D; h) you know..

Page 11: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.

http://web.mit.edu/cocosci/Papers/significance.pdfShould AI also distrust priors? Priors can encode background knowledge.. (There is evidence that human brain uses priors)

Page 12: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.

Density Estimation(as the general objective of Statistical Learning)

• Given data D whose instances are made-up of attributes x that are distributed according to P*(x), we want to learn an estimate P’ to P* such that the distance between P* and P’ is minimized

• Once we have P’, we can use it for (i) prediction (ii) completion (iii) generation

• We need to decide how to represent P’ – We shall assume graphical models

• Bayes Networks or Markov Networks

• Often x can be partitioned into X (the “input attributes”) and Y (the “output attributes”)– In such a case, rather than learn P*(x), we might want to learn P*(Y|X)– If we do this, then we are doing discriminative learning (as against

generative learning)

Page 13: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.

Generative vs. Discriminative

Generative Learning• More general (after all if you have

P(Y, X) you can predict Y given X as well as do other inferences– You can predict jokes as well as make

them up (or predict spam mails as well as generate them)

• In trying to learn P(Y,X), we are often forced to make many independence assumptions both in Y and X—and these may be wrong..– Interestingly, this type of high bias

can help generative techniques when there is too little data

Discriminative Learning• More to the point (if what you want

is P(Y|X), why bother with P(Y,X) which is after all P(Y|X) *P(X) and thus models the dependencies between X’s also?

• Since we don’t need to model dependencies among X, we don’t need to make any independence assumptions among them. So, we can merrily use highly correlated features..– Interestingly, this freedom can hurt

discriminative learners when there is too little data (as over fitting is easy)

P(y)P(x|y) = P(y,x) = P(x)P(y|x)

Page 14: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.

Dimensions of Statistical Learning Tasks

Model constraints• Type of network being

learned– Bayes Network vs. Markov

network

• Topology given; CPTs to be learned

• Only relevant attributes are given; need to learn topology as well as CPTs

– Tricky part for MLE is that increasing the connectivity of a network cannot reduce likelihood

• We don’t know what the relevant attributes are

Observability of data• Complete data

– Each data instance gives the values of each of the attributes

• Incomplete data– Some of the data instances

might be missing the values for some of the attributes

• Hidden attributes (variables)– None of the data instances

have values for some of the attributes (which often correspond to “intermediate” concepts which help improve the sparsity of network. E.g. “syndromes” which connect symptoms to diseases; or class variables in mixture models

Sample complexity linearly varies with # parameters to be learned, and #parameters vary exponentially with # edges in the graphical model

Philisophy of learning

• Bayesian• Keep a distribution

over hypotheses• MAP

• Keep just the best hypothesis (that has the highest prior + likelihood)

• MLE• Keep just the

hypothesis that maximizes likelihood

Page 15: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.

Our Agenda

• We shall focus on density estimation tasks and consider generative case first

• We will focus on Bayes Networks first (and time permitting, Markov Networks)

• We will focus on MLE learning first (and then full bayesian learning)

• We will focus on complete data first; and then incomplete data and/or hidden variables

Page 16: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.
Page 17: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.

Steps in ML based learning1. Write down an expression for the likelihood of the data as a function of the

parameter(s)Assume i.i.d. distribution

2. Write down the derivative of the log likelihood with respect to each parameter

3. Find the parameter values such that the derivatives are zeroThere are two ways this step can become complex

Individual (partial) derivatives lead to non-linear functions (depends on the type of distribution the parameters are controlling; binomial is a very easy case)

Individual (partial) derivatives will involve more than one parameter (thus leading to simultaneous equations)

In general, we will need to use continuous function optimization techniquesOne idea is to use gradient descent to find the point where the derivative goes to zero. But for gradient

descent to find global optimum, we need to know for sure that the function we are optimizing has a single optimum (this is why convex functions are important. If the likelihood is a convex function, then gradient descent will be guaranteed to find the global minimum).

Page 18: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.

Note that for us, data are 2-attribute tuples [Flavor, Wrapper]

Page 19: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.

No entanglement of parameters for complete data for Bayes nets with known topology and tabular CPTs Specifically, each partial derivative will involve only one parameter i.e., each partial derivative contains only one parameter so you are solving single variable equations rather than simultaneous equations. doesn’t hold for markov nets ; doesn’t also hold for Bayes nets where CPDs induce direct parameter dependencies

Page 20: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.

Celebrating ease of learning for bayes nets with complete data!

• So we just noted that if we know the topology of the Bayes net, and we have complete data then the parameters are un-entangled, and can be learned separately from just data counts.

• Questions: How big a deal is this? – Can we have complete data?– Can we have known topology?

Page 21: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.

Learning the parameters for Continuous Case (Gaussian Distribution)

Page 22: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.

Problems with ML and Bayesian Learning..

• ML based learning is unable to take the size of the data into account (1/3 is the same as 1M/3M)

• We however tend to start with a prior, and are less willing to change the prior unless shown enough evidence– Bayesian learning can handle

this..

If a thumbtack came up heads once when you tossed it 3 times, what is the probability that it will come up heads the next time?

Now, a coin came up heads once when you tossed it heads three times. What do you think is the probability that it will come up heads next time?

How about if it came up heads 1million times in 3 million trials?

Page 23: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.

Bayesian Learning (for coin toss..)• Let q be the probability that the coin comes

heads– Each different value of q is a different hypothesis– So P(h) –the hypothesis prior– can be specified by

specifying P(q)• Starting with prior on , q we just need to

compute the posterior • Challenge: Find a distribution over continuous

space that – can be represented compactly– And keeps its form upon update..

• Example: Uniform; but what if we have a more information?

• Beta distributions– Think of a and b as the number of heads and tails

you have seen prior to the start of this experiment – Update:

head

Page 24: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.

“Conjugate Prior”

• A prior distribution family Pc is considered a conjugate prior for a likelihood function family Pl if starting with a hypothesis prior Pc

1 from Pc and seeing data with

likelihood Pl from Pl the posterior of the hypothesis prior will also be in Pc – Beta distributions are conjugate priors for bernouli (Binomial)

likelihood distributions– Dirichlet distributions are conjugate priors for Multinomial

likelihood distributions– Normal-Wishart distributions are conjugate priors for

Gaussian likelihood distributions

User
Add something about structured priors..Talk about the frequentist vs. bayesian holy wars
Page 25: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.

Bayesian Prediction

• So suppose we started with Beta[a,b] as the prior– Probability of heads will be a/(a+b)

• If you manage to evaluate ∫P(heads|q )P(q) dq

• Now after seeing Dh heads and Dt tails, the posterior will be Beta[a+Dh, b+Dt ]– Probability of heads now will be

• (a+Dh )/(a+Dh + b+Dt )

• So, to the ML estimate, you just add a + b virtual samples…– Is what you did with Laplace Smoothing…

Laplace smoothing is a backdoor way of making ML predictions be in line with full Bayesian learning…

Page 26: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.

Multi-parameter case (Assume Parameter Independence)

Each example is “inserted”Into the bayes networkas evidence and posteriorover parameters is queried

The wrench in the worksIs that the size of theNetwork grows with examples(!) and we haveContinuous quantities

For table CPDs,Prior shouldbe a dirichletDistribution

Notice that weassumedParameterIndependence

Dirichlet Prior

Page 27: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.

Priors and Background Knowledge

• Hypothesis priors can be seen as providing background knowledge

• Background knowledge is also helpful in “logical learning”– Sao Paulo airport example

Page 28: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.

Case Study: Learning Bayes Net models for Relational database tables

• Consider a relational table in RDBMS with n attributes

– Say an employee table giving the age, position, salary etc of each employee

• Suppose we want to learn the generative model underlying it

• Suppose we were able to hypothesize the topology

– We might be able to do so if (a) we know the domain or (b) we know some of the causal dependencies in the data

• If the relational table is “complete” –i.e.., every tuple gives the value for every attribute, (which is the standard RDBMS model), then learning the parameters of this network is easy!

• Now, suppose the table is slightly “dirty”—in that there are tuples that have some missing values for some of the attributes

– Say, some of the employee tuples are missing age information, others are missing salary information etc.

• If only a small percent of the tuples are incomplete, then we can

– 1. Learn the model using the complete tuples– 2. predict the null values in the dirty tuples using

the learned model

• But, if a non-trivial percent of the tuples are incomplete, then, we might want to continue for step 2 above by

– 3. Now that we have “completed” all the incomplete tuples, we have fully complete data. Learn the model with this Completed data; and see if it is any better

• A model is better if it provides a higher likelihood for the observed data

• But why stop here? Continue and use the new model to re-predict the missing values, and iterate

– This is the basic idea of EM (Expectation Maximization) algorithm

Page 29: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.

What if the best generative model contains attribute that are not mentioned in the table?

• In the previous relational table scenario, we assumed that some of the tuples are missing some of the attribute values.

• What if all tuples are missing some attribute values?

– E.g. Educational level of the employee can be an attribute that is missing in the current table.

• This is like having an attribute column whose value is not known for any of the tuples

– Why would we do it?– Can we still use EM?

• Can we still use EM?– Surprisingly, it turns out yes. In the earlier scenario, we used

the complete tuples for setting up the initial model, but then used it to complete the data, and loop

– There is no reason why we should initialize using complete data. We can initialize the model (parameters) randomly, and still do the EM looping!

• But why would we do it?– Given a complete relational table, such as the employee one,

why would we start hypothesizing hidden attributes?– Because the right hypothesis on the hidden attribute can

significantly reduce the number of parameters– For example, the educational level of the employee might

cluster employees into “PhD” folks (who presumably have high salaries, interesting positions, and mature ages), and “non-PhD” folks (who presumably have low salaries, green-behind-the-ears ages, and assembly programming kind of jobs), and in each cluster the distributions of the attribute values are different (as described above)

• So, – Hypothesizing hidden attributes reduces the parameters to be

estimated, but makes their estimation hard– Not hypothesizing them allows us to deal with complete data,

but might require exponentially many parameters to be learned (from the same data—making the parameters, while easy to estimate, pretty worthless in terms of accuracy.

Page 30: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.

The “size of the step” is determined adaptively by where the max of the lowerbound is..

--In contrast, gradient descent requires a stepsize parameter --Newton Raphson requires second derivative..

Why does EM Work? Log of Sums don’t have easy closed form optima; use Jensen’s inequality and focus on Sum of logs which will be a lower bound

Ft (J) is an arbitrary prob dist over J

By Jensen’s inequality

Page 31: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.

Involves Bayes Net inference; can get by with approximate inference

Involves maximization; can get away with just improvement (i.e., a few steps of gradient ascent)

Page 32: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.
Page 33: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.

0. Initialize the parameters randomly Loop

Inference

Page 34: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.

Candy Example

Start with 1000 samples

Initialize parameters as

Page 35: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.
Page 36: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.

Structure (Topology) Learning

• Search over different network topologies• Question: How do we decide which topology is

better?– Idea 1: Check if the independence relations posited by

the topology actually hold– Idea 2: Consider which topology agrees with the data

more (i.e., provides higher likelihood)• But need to be careful--increasing edges in a network cannot

reduce likelihood– Idea 3: Need to penalize complexity of the network (either using

prior on network topologies, or using syntactic complexity measures)

Page 37: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.

Structure learning with BIC/MDL Scores

Dim(G) is the number of free parameters in the model The denser the connections, the higher dim(G)The more structured the CPTs the lower dim(G)

Page 38: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.
Page 39: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.

Relational Probabilistic Models

• Bayes nets are “propositional” models. • We will now look at a generalization of bayes

nets to “relational” case – ..where the world is made up of objects and

relations between them– ..think predicate logic (not First order though..)

Page 40: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.
Page 41: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.
Page 42: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.
Page 43: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.

Note that we are assuming the same CPTHolds for ALL authors/papers/reviews --A tremendous saving in parameters

Page 44: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.
Page 45: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.
Page 46: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.
Page 47: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.
Page 48: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.
Page 49: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.

Sort of like propositional semantics for predicate logic…

Page 50: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.

PRMs vs. Bayes Nets

• The semantics of PRMs are in terms of the underlying bayes nets

• However,– The PRM defines the dependencies at the class

level rather than at the object level • ..and these dependencies and CPTs are used for all

objects of that class

– The PRM allows dependencies between the attributes of different objects (e.g. review’s mood affects paper’s decision)

Page 51: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.
Page 52: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.

Sort of like doing predicate logic inference by compiling to propositional logic

Page 53: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.

Need lifted inference techniques

Page 54: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.
Page 55: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.

Note that the author of the paper doesn’t matter since we assume the CPT is the same for all papers --Significant sample efficiency

Page 56: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.
Page 57: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.

--Slides beyond this not covered--

Page 58: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.

Undirected Probabilistic Graphical Models(Markov Nets)

(Slides from Sam Roweis Lecture)

Page 59: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.
Page 60: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.
Page 61: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.
Page 62: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.
Page 63: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.

Connection to MCMC: MCMC requires sampling a node given its markov blanket Need to use P(x|MB(x)). For Bayes nets MB(x) contains more nodes than are mentioned in the local distribution CPT(x) For Markov nets,

Page 64: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.

Because neighbor relation is symmetric nodes xi and xj are both neighbors of each other..

Page 65: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.
Page 66: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.
Page 67: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.
Page 68: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.

Markov Networks• Undirected graphical models

Cancer

CoughAsthma

Smoking

Potential functions defined over cliques

Smoking Cancer Ф(S,C)

False False 4.5

False True 4.5

True False 2.7

True True 4.5

c

cc xZxP )(

1)(

x c

cc xZ )(

Page 69: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.
Page 70: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.

Markov Networks• Undirected graphical models

Log-linear model:

Weight of Feature i Feature i

otherwise0

CancerSmokingif1)CancerSmoking,(1f

5.11 w

Cancer

CoughAsthma

Smoking

iii xfw

ZxP )(exp

1)(

Page 71: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.
Page 72: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.
Page 73: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.
Page 74: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.
Page 75: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.
Page 76: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.
Page 77: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.
Page 78: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.

Markov Nets vs. Bayes Nets

Property Markov Nets Bayes Nets

Form Prod. potentials Prod. potentials

Potentials Arbitrary Cond. probabilities

Cycles Allowed Forbidden

Partition func. Z = ? global Z = 1 local

Indep. check Graph separation D-separation

Indep. props. Some Some

Inference MCMC, BP, etc. Convert to Markov

Page 79: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.
Page 80: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.

Inference in Markov Networks

• Goal: Compute marginals & conditionals of

• Exact inference is #P-complete• Conditioning on Markov blanket is easy:

• Gibbs sampling exploits this

exp ( )( | ( ))

exp ( 0) exp ( 1)

i ii

i i i ii i

w f xP x MB x

w f x w f x

1( ) exp ( )i i

i

P X w f XZ

exp ( )i i

X i

Z w f X

Partition function cancels out

Page 81: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.

MCMC: Gibbs Sampling

state ← random truth assignmentfor i ← 1 to num-samples do for each variable x sample x according to P(x|neighbors(x)) state ← state with new value of xP(F) ← fraction of states in which F is true

Page 82: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.

Other Inference Methods

• Many variations of MCMC• Belief propagation (sum-product)• Variational approximation• Exact methods

Page 83: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.

Learning Markov Networks

• Learning parameters (weights)– Generatively– Discriminatively

• Learning structure (features)• In this tutorial: Assume complete data

(If not: EM versions of algorithms)

Page 84: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.

Entanglement in log likelihood…a b c

Page 85: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.

Generative Weight Learning

• Maximize likelihood or posterior probability• Numerical optimization (gradient or 2nd order) • No local maxima

• Requires inference at each step (slow!)

No. of times feature i is true in data

Expected no. times feature i is true according to model

)()()(log xnExnxPw iwiwi

1( ) exp ( )i i

i

P X w f XZ

exp ( )i i

X i

Z w f X

Page 86: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.

Discriminative Weight Learning

• Maximize conditional likelihood of query (y) given evidence (x)

• Approximate expected counts by counts in MAP state of y given x

No. of true groundings of clause i in data

Expected no. true groundings according to model

),(),()|(log yxnEyxnxyPw iwiwi

Page 87: P(h i ) is called the hypothesis prior Nothing special about “learning” – just vanilla probabilistic inference.

Structure Learning

• How to learn the structure of a Markov network?– … not too different from learning structure for a

Bayes network: discrete search through space of possible graphs, trying to maximize data probability….