Notes on Graphical Models Padhraic Smyth Department of Computer Science University of California,...

Notes on Graphical Models

Padhraic SmythDepartment of Computer Science

University of California, Irvine

ProbabilisticModel

Real WorldData

P(Data | Parameters)

ProbabilisticModel

Real WorldData

P(Parameters | Data)

ProbabilisticModel

Real WorldData

Generative Model, Probability

Inference, Statistics

Part 1: Review of Probability

Notation and Definitions

• X is a random variable– Lower-case x is some possible value for X– “X = x” is a logical proposition: that X takes value x– There is uncertainty about the value of X

• e.g., X is the Dow Jones index at 5pm tomorrow

• p(X = x) is the probability that proposition X=x is true– often shortened to p(x)

• If the set of possible x’s is finite, we have a probability distribution and p(x) = 1

• If the set of possible x’s is infinite, p(x) is a density function, and p(x) integrates to 1 over the range of X

Example

• Let X be the Dow Jones Index (DJI) at 5pm Monday August 22nd (tomorrow)

• X can take real values from 0 to some large number

• p(x) is a density representing our uncertainty about X– This density could be constructed from historical data, e.g.,

– After 5pm p(x) = 1 for some value of x (no uncertainty), once we hear from Wall Street what x is

Probability as Degree of Belief

• Different agents can have different p(x)’s– Your p(x) and the p(x) of a Wall Street expert might be

quite different– OR: if we were on vacation we might not have access to

stock market information• we would still be uncertain about p(x) after 5pm

• So we should really think of p(x) as p(x | BI)

– Where BI is background information available to agent I

– (will drop explicit conditioning on BI in notation)

• Thus, p(x) represents the degree of belief that agent I has in proposition x, conditioned on available background information

Comments on Degree of Belief

• Different agents can have different probability models– There is no necessarily “correct” p(x)– Why? Because p(x) is a model built on whatever assumptions or

background information we use– Naturally leads to the notion of updating

• p(x | BI) -> p(x | BI, CI)

• This is the subjective Bayesian interpretation of probability– Generalizes other interpretations (such as frequentist)– Can be used in cases where frequentist reasoning is not applicable– We will use “degree of belief” as our interpretation of p(x) in this

tutorial

• Note!– Degree of belief is just our semantic interpretation of p(x)– The mathematics of probability (e.g., Bayes rule) remain the same

regardless of our semantic interpretation

Multiple Variables

• p(x, y, z)– Probability that X=x AND Y=y AND Z =z– Possible values: cross-product of X Y Z

– e.g., X, Y, Z each take 10 possible values• x,y,z can take 103 possible values• p(x,y,z) is a 3-dimensional array/table

– Defines 103 probabilities• Note the exponential increase as we add more

variables

– e.g., X, Y, Z are all real-valued• x,y,z live in a 3-dimensional vector space• p(x,y,z) is a positive function defined over this space,

integrates to 1

Conditional Probability

• p(x | y, z)– Probability of x given that Y=y and Z = z– Could be

• hypothetical, e.g., “if Y=y and if Z = z”• observational, e.g., we observed values y and z

– can also have p(x, y | z), etc– “all probabilities are conditional probabilities”

• Computing conditional probabilities is the basis of many prediction and learning problems, e.g.,– p(DJI tomorrow | DJI index last week)– expected value of [DJI tomorrow | DJI index next week)– most likely value of parameter given observed data

Computing Conditional Probabilities

• Variables A, B, C, D– All distributions of interest related to A,B,C,D can be computed

from the full joint distribution p(a,b,c,d)

• Examples, using the Law of Total Probability

– p(a) = {b,c,d} p(a, b, c, d)

– p(c,d) = {a,b} p(a, b, c, d)

– p(a,c | d) = {b} p(a, b, c | d)

where p(a, b, c | d) = p(a,b,c,d)/p(d)

• These are standard probability manipulations: however, we will see how to use these to make inferences about parameters and unobserved variables, given data

Two Practical Problems

(Assume for simplicity each variable takes K values)

• Problem 1: Computational Complexity– Conditional probability computations scale as O(KN)

• where N is the number of variables being summed over

• Problem 2: Model Specification– To specify a joint distribution we need a table of O(KN) numbers

– Where do these numbers come from?

Two Key Ideas

• Problem 1: Computational Complexity– Idea: Graphical models

• Structured probability models lead to tractable inference

• Problem 2: Model Specification– Idea: Probabilistic learning

• General principles for learning from data

Part 2: Graphical Models

“…probability theory is more fundamentally concerned with the structure of reasoning and causation than with numbers.”

Glenn Shafer and Judea PearlIntroduction to Readings in Uncertain Reasoning,Morgan Kaufmann, 1990

Conditional Independence

• A is conditionally independent of B given C iff p(a | b, c) = p(a | c)

(also implies that B is conditionally independent of A given C)

• In words, B provides no information about A, if value of C is known

• Example:– a = “reading ability”– b = “height”– c = “age”

• Note that conditional independence does not imply marginal independence

Graphical Models

• Represent dependency structure with a directed graph– Node <-> random variable– Edges encode dependencies

• Absence of edge -> conditional independence– Directed and undirected versions

• Why is this useful?– A language for communication– A language for computation

• Origins: – Wright 1920’s– Independently developed by Spiegelhalter and Lauritzen in

statistics and Pearl in computer science in the late 1980’s

Examples of 3-way Graphical Models

A CB Marginal Independence:p(A,B,C) = p(A) p(B) p(C)

Conditionally independent effects:p(A,B,C) = p(B|A)p(C|A)p(A)

B and C are conditionally independentGiven A

e.g., A is a disease, and we model B and C as conditionally independentsymptoms given A

Independent Causes:p(A,B,C) = p(C|A,B)p(A)p(B)

A CB Markov dependence:p(A,B,C) = p(C|B) p(B|A)p(A)

Real-World Example

Monitoring Intensive-Care Patients• 37 variables• 509 parameters …instead of 237

(figure courtesy of Kevin Murphy/Nir Friedman)

PCWP CO

HREKG HRSAT

ERRCAUTERHRHISTORY

CATECHOL

SAO2 EXPCO2

ARTCO2

VENTALV

VENTLUNG VENITUBE

DISCONNECT

MINVOLSET

VENTMACHKINKEDTUBEINTUBATIONPULMEMBOLUS

PAP SHUNT

ANAPHYLAXIS

MINOVL

FIO2PRESS

INSUFFANESTHTPR

LVFAILURE

ERRBLOWOUTPUTSTROEVOLUMELVEDVOLUME

HYPOVOLEMIA

Directed Graphical Models

p(A,B,C) = p(C|A,B)p(A)p(B)

In general, p(X1, X2,....XN) = p(Xi | parents(Xi ) )

• Probability model has simple factored form

• Directed edges => direct dependence

• Absence of an edge => conditional independence

• Also known as belief networks, Bayesian networks, causal networks

In general, p(X1, X2,....XN) = p(Xi | parents(Xi ) )

Reminders from Probability….

• Law of Total Probability

P(a) = b P(a, b) = b P(a | b) P(b)

– Conditional version:

P(a|c) = b P(a, b|c) = b P(a | b , c) P(b|c)

Graphical Models for Computation

PCWP CO

HREKG HRSAT

ERRCAUTERHRHISTORY

CATECHOL

SAO2 EXPCO2

ARTCO2

VENTALV

VENTLUNG VENITUBE

DISCONNECT

MINVOLSET

VENTMACHKINKEDTUBEINTUBATIONPULMEMBOLUS

PAP SHUNT

ANAPHYLAXIS

MINOVL

FIO2PRESS

INSUFFANESTHTPR

LVFAILURE

ERRBLOWOUTPUTSTROEVOLUMELVEDVOLUME

HYPOVOLEMIA

• Say we want to compute P(BP|Press)

• Law of total probability: -> must sum over all other variables -> exponential in # variables

• Factorization: -> joint distribution factors into smaller tables

• Can now sum over smaller tables, can reduce complexity dramatically

Example

p(A, B, C, D, E, F, G) = p( variable | parents ) = p(A|B)p(C|B)p(B|D)p(F|E)p(G|E)p(E|D) p(D)

Example

Say we want to compute p(a | c, g)

Example

Direct calculation: p(a|c,g) = bdef p(a,b,d,e,f | c,g)

Complexity of the sum is O(K4)

Example

Reordering (using factorization):

b p(a|b) d p(b|d,c) e p(d|e) f p(e,f |g)

Example

Reordering:

bp(a|b) d p(b|d,c) e p(d|e) f p(e,f |g)

p(e|g)

Example

Reordering:

bp(a|b) d p(b|d,c) e p(d|e) p(e|g)

p(d|g)

Example

Reordering:

bp(a|b) d p(b|d,c) p(d|g)

p(b|c,g)

Example

Reordering:

bp(a|b) p(b|c,g)

p(a|c,g) Complexity is O(K), compared to O(K4)

Graphs with “loops”

Message passing algorithm does not work whenthere are multiple paths between 2 nodes

Graphs with “loops”

General approach: “cluster” variablestogether to convert graph to a tree

Reduce to a Tree

Probability Calculations on Graphs

• General algorithms exist - beyond trees – Complexity is typically O(m (number of parents ) )

(where m = arity of each node)– If single parents (e.g., tree), -> O(m)– The sparser the graph the lower the complexity

• Technique can be “automated”– i.e., a fully general algorithm for arbitrary graphs– For continuous variables:

• replace sum with integral– For identification of most likely values

• Replace sum with max operator

Part 3: Learning with Graphical Models

Notes on Graphical Models Padhraic Smyth Department of Computer Science University of California,...

Documents

Transcript of Notes on Graphical Models Padhraic Smyth Department of Computer Science University of California,...

CS 277: Data Mining Recommender Systems Padhraic Smyth Department of Computer Science University of California, Irvine.

Data Mining Lectures Lectures 6/7: Regression Padhraic Smyth, UC Irvine ICS 278: Data Mining Lectures 6, 7: Regression Algorithms Padhraic Smyth Department.

Data Mining Lectures Lecture 18: Credit Scoring Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 18: Credit Scoring Padhraic Smyth Department of.

Data Mining Lectures Lecture 7: Classification Padhraic Smyth, UC Irvine ICS 278: Data Mining Lectures 7 and 8: Classification Algorithms Padhraic Smyth.

Conditional Chow-Liu Tree Structures for Modeling Discrete-Valued Vector Time Series Sergey Kirshner, UC Irvine Padhraic Smyth, UC Irvine Andrew Robertson,

Author-Topic Models for Large Text Corpora Padhraic Smyth Department of Computer Science University of California, Irvine In collaboration with: Mark Steyvers.

Lecture 16: Unsupervised Learning from Text Padhraic Smyth Department of Computer Science University of California, Irvine.

Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 1: Introduction to Data Mining Padhraic Smyth Department.

CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine CS 277: Data Mining Mining Web Link Structure.

Data Mining Lectures Lecture 17: Web Log Mining Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 17: Web Log Mining Padhraic Smyth Department of.

Data Mining and the OptIPuter Padhraic Smyth University of California, Irvine.

© Padhraic Smyth, UC Irvine A Review of Hidden Markov Models for Context-Based Classification ICML’01 Workshop on Temporal and Spatial Learning Williams.

Clustering Credit: Padhraic Smyth University of California, Irvine.

CS 277: Data Mining Analysis of Web User Data Padhraic Smyth Department of Computer Science University of California, Irvine.

1 HICSS Keynote Talk, Jan 2008 © Padhraic Smyth, UC Irvine: From Gauss to Google: Data Analysis in the Digital Age Padhraic Smyth Department of Computer.

Algorithms for Data Mining and Querying with Graphs Investigators: Padhraic Smyth, Sharad Mehrotra University of California, Irvine Students: Joshua O’

Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 3: Exploratory Data Analysis and Visualization.

Probabilistic Learning Tutorial: P. Smyth, UC Irvine, August 2005 Principles and Applications of Probabilistic Learning Padhraic Smyth Department of Computer.

CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine.

Inference in First-Order Logic CS 271: Fall 2007 Instructor: Padhraic Smyth.