Introduction to Graphical Models for Data Mining Arindam Banerjee banerjee@cs.umn.edu Dept of...

Post on 11-Jan-2016

216 views 0 download

Tags:

Transcript of Introduction to Graphical Models for Data Mining Arindam Banerjee banerjee@cs.umn.edu Dept of...

Introduction to Graphical Models for Data Mining

Arindam Banerjeebanerjee@cs.umn.edu

Dept of Computer Science & EngineeringUniversity of Minnesota, Twin Cities

16th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

July 25, 2010

Graphical Models 2

Introduction

• Graphical Models– Brief Overview

• Part I: Tree Structured Graphical Models– Exact Inference

• Part II: Mixed Membership Models– Latent Dirichlet Allocation– Generalizations, Applications

• Part III: Graphical Models for Matrix Analysis– Probabilistic Matrix Factorization– Probabilistic Co-clustering – Stochastic Block Models

Graphical Models 3

Graphical Models: What and Why

• Statistical Data Analaysis– Build diagnostic/predictive models from data

– Uncertainty quantification based on (minimal) assumptions

• The I.I.D. assumption– Data is independently and identically distributed

– Example: Words in a doc drawn i.i.d. from the dictionary

• Graphical models– Assume (graphical) dependencies between (random) variables

– Closer to reality, domain knowledge can be captured

– Learning/inference is much more difficult

Graphical Models 4

Flavors of Graphical Models

• Basic nomenclature– Node = random variable, maybe observed/hidden

– Edge = statistical dependency

• Two popular flavors: ‘Directed’ and ‘Undirected’

• Directed Graphs– A directed graph between random variables, causal dependencies

– Example: Bayesian networks, Hidden Markov Models

– Joint distribution is a product of P(child|parents)

• Undirected Graphs– An undirected graph between random variables

– Example: Markov/Conditional random fields

– Joint distribution in terms of potential functions

X1

X3

X4 X5

X2

Graphical Models 5

Bayesian Networks

• Joint distribution in terms of P(X|Parents(X))

X1

X3

X4 X5

X2

Graphical Models 6

Example I: Burglary Network

This and several other examples are from the Russell-Norvig AI book

Computing Probabilities of Events

• Probability of any event can be computed:P(B,E,A,J,M) = P(B) P(E|B) P(A|B,E) P(J|B,E,A) P(M|B,E,A,J)

= P(B) P(E) P(A|B,E) P(J|A) P(M|A)

• Example:P(b,¬e,a, ¬j,m) = P(b) P(¬e)P(a|b,¬e) P(¬j|a) P(m|a)

Graphical Models 7

Graphical Models 8

Example II: Rain Network

Graphical Models 9

Example III: “Car Won’t Start” Diagnosis

Graphical Models 10

Inference

• Some variables in the Bayes net are observed– the evidence/data, e.g., John has not called, Mary has called

• Inference– How to compute value/probability of other variables

– Example: What is the probability of Burglary, i.e., P(b|¬j,m)

Graphical Models 11

Inference Algorithms

• Graphs without loops: Tree-structured Graphs– Efficient exact inference algorithms are possible

– Sum-product algorithm, and its special cases• Belief propagation in Bayes nets

• Forward-Backward algorithm in Hidden Markov Models (HMMs)

• Graphs with loops– Junction tree algorithms

• Convert into a graph without loops

• May lead to exponentially large graph

– Sum-product/message passing algorithm, ‘disregarding loops’• Active research topic, correct convergence ‘not guaranteed’

• Works well in practice

– Approximate inference

Graphical Models 12

Approximate Inference

• Variational Inference– Deterministic approximation

– Approximate complex true distribution over latent variables

– Replace with family of simple/tractable distributions• Use the best approximation in the family

– Examples: Mean-field, Bethe, Kikuchi, Expectation Propagation

• Stochastic Inference– Simple sampling approaches

– Markov Chain Monte Carlo methods (MCMC)• Powerful family of methods

– Gibbs sampling• Useful special case of MCMC methods

Part I: Tree Structured Graphical Models

• The Inference Problem

• Factor Graphs and the Sum-Product Algorithm

• Example: Hidden Markov Models

• Generalizations

Graphical Models 13

The Inference Problem

Graphical Models 14

Complexity of Naïve Inference

Graphical Models 15

Bayes Nets to Factor Graphs

Graphical Models 16

Factor Graphs: Product of Local Functions

Graphical Models 17

Marginalize Product of Functions (MPF)

• Marginalize product of functions

• Computing marginal functions

• The “not-sum” notation

Graphical Models 18

MPF using Distributive Law

• We focus on two examples: g1(x1) and g3(x3)

• Main Idea: Distributive law

ab + ac = a(b+c)

• For g1(x1), we have

• For g3(x3), we have

Graphical Models 19

Computing Single Marginals

• Main Idea:– Target node becomes the root

– Pass messages from leaves up to the root

Graphical Models 20

Compute product of descendants with fThen do not-sum over part

Message Passing

Graphical Models 21

Compute product of descendants

Example: Computing g1(x1)

Graphical Models 22

Example: Computing g3(x3)

Graphical Models 23

Efficient Algorithm is encoded in the structure of the factor graph

Hidden Markov Models (HMMs)

Graphical Models 24

Latent variables:z0,z1,…,zt-1,zt,zt+1,…,zT

Observed variables:x1,…,xt-1,xt,xt+1,…,xT

Inference Problems:1.Compute p(x1:T) 2.Compute p(zt|x1:T)3.Find maxz1:T

p(z1:T|x1:T)

Similar problem for chain-structured Conditional Random Fields (CRFs)

The Sum-Product Algorithm

• To compute gi(xi), form a tree rooted at xi

• Starting from the leaves, apply the following two rules– Product Rule:

At a variable node, take the product of descendants

– Sum-product Rule:

At a factor node, take the product of f with descendants;

then perform not-sum over the parent node

• To compute all marginals– Can be done one at a time; repeated computations, not efficient

– Simultaneous message passing following the sum-product algorithm

– Examples: Belief Propagation, Forward-Backward algorithm, etc.

Graphical Models 25

Sum-Product Updates

Graphical Models 26

Sum-Product Updates

Graphical Models 27

Example: Step 1

Graphical Models 28

Example: Step 2

Graphical Models 29

Example: Step 3

Graphical Models 30

Example: Step 4

Graphical Models 31

Example: Step 5

Graphical Models 32

Example: Termination

Graphical Models 33

HMMs Revisited

Graphical Models 34

Latent variables:z0,z1,…,zt-1,zt,zt+1,…,zT

Observed variables:x1,…,xt-1,xt,xt+1,…,xT

Inference Problem:1.Compute p(x1:T)2.Compute p(zt|x1:T)

Sum-product algorithm is knownas the `forward-backward’ algorithm

Smoothing in Kalman Filtering

Distributive Law on Semi-Rings

• Idea can be applied to any commutative semi-ring

• Semi-ring 101 – Two operations (+,×): Associative, Commutative, Identity

– Distributive law: a×b + a×c = a×(b+c)

Graphical Models 35

•Belief Propagation in Bayes nets•MAP inference in HMMs•Max-product algorithm•Alternative to Viterbi Decoding•Kalman Filtering•Error Correcting Codes•Turbo Codes•…

Message Passing in General Graphs

• Tree structured graphs– Message passing is guaranteed to give correct solutions

– Examples: HMMs, Kalman Filters

• General Graphs – Active research topic

• Progress has been made in the past 10 years

– Message passing• May not converge

• May converge to a ‘local minima’ of ‘Bethe variational free energy’

– New approaches to convergent and correct message passing

• Applications– True Skill: Ranking System for Xbox Live

– Turbo Codes: 3G, 4G phones, satellite comm, Wimax, Mars orbiter

Graphical Models 36

Part II: Mixed Membership Models

• Mixture Models vs Mixed Membership Models

• Latent Dirichlet Allocation

• Inference– Mean-Field and Collapsed Variational Inference

– MCMC/Gibbs Sampling

• Applications

• Generalizations

Graphical Models 37

Graphical Models 38

Background: Plate Diagrams

a

b

3

a

b1 b2 b3

Compact representation of large Bayesian networks

θ

x

dn

d

39Graphical Models

0.3

1

-2

x

-10 -8 -6 -4 -2 0 2 4 6 8 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

-10 -8 -6 -4 -2 0 2 4 6 8 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

-10 -8 -6 -4 -2 0 2 4 6 8 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Model 1: Independent Features

d=3, n=1

θ

x

dn

d

π

z

θ

x

dn

kd

40

Model 2: Naïve Bayes (Mixture Models)

Graphical Models

-1

3

-0.5

x

-10 -8 -6 -4 -2 0 2 4 6 8 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

-10 -8 -6 -4 -2 0 2 4 6 8 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

-10 -8 -6 -4 -2 0 2 4 6 8 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

-10 -8 -6 -4 -2 0 2 4 6 8 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

-10 -8 -6 -4 -2 0 2 4 6 8 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

-10 -8 -6 -4 -2 0 2 4 6 8 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

41Graphical Models

Naïve Bayes Model

n k

π z θx

d d

0.1

2.1

-1.5

x

-10 -8 -6 -4 -2 0 2 4 6 8 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

-10 -8 -6 -4 -2 0 2 4 6 8 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

-10 -8 -6 -4 -2 0 2 4 6 8 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

-10 -8 -6 -4 -2 0 2 4 6 8 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

-10 -8 -6 -4 -2 0 2 4 6 8 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

-10 -8 -6 -4 -2 0 2 4 6 8 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

42Graphical Models

Naïve Bayes Model

n k

π z θx

d d

θ

xdn

d

π

z

θ

xdn

kd

π

z

xdn

α

θkd

43

Model 3: Mixed Membership Model

Graphical Models

0.7

3.1

-1

x

-10 -8 -6 -4 -2 0 2 4 6 8 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

-10 -8 -6 -4 -2 0 2 4 6 8 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

-10 -8 -6 -4 -2 0 2 4 6 8 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

-10 -8 -6 -4 -2 0 2 4 6 8 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

-10 -8 -6 -4 -2 0 2 4 6 8 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

-10 -8 -6 -4 -2 0 2 4 6 8 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

44Graphical Models

Mixed Membership Models

π z xdn

α θ

k

d

0.9

2.1

-2

x

-10 -8 -6 -4 -2 0 2 4 6 8 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

-10 -8 -6 -4 -2 0 2 4 6 8 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

-10 -8 -6 -4 -2 0 2 4 6 8 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

-10 -8 -6 -4 -2 0 2 4 6 8 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

-10 -8 -6 -4 -2 0 2 4 6 8 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

-10 -8 -6 -4 -2 0 2 4 6 8 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

45Graphical Models

Mixed Membership Models

π z xdn

α θ

k

d

0.1

2.1

-1.5

x

46Graphical Models

Mixture Model vs Mixed Membership Model

-1

3

-0.5

x0.7

3.1

-1

x

0.9

2.1

-2

x

Single component membership Multi-component mixed membership

Graphical Models 47

Nd D

zi

xi

(d)

p(d) Dirichlet()

zi Discrete( (d) )

xi Discrete( (zi) )

K

distribution over topicsfor each document

topic assignment for each word

distribution over words for each topic

word generated from assigned topic

Dirichlet priors

Latent Dirichlet Allocation (LDA)

j

Graphical Models 48

z~Discrete()

~Drichlet(α)

x1 x3x2

z1 z2 z3

z11

x11

z12

x12

z13

x13

1

z21

x21

2

z22

x22

a

x1 x3x2

z

x1 x3x2

z1 z2 z3

x1 x3x2

z1 z2 z3

Graphical Models 49

LDA Generative Model

document1 document2

Graphical Models 50

LDA Generative Model

document1 document2

Learning: Inference and Estimation

Graphical Models 51

Variational Inference

Graphical Models 52

Variational EM for LDA

Graphical Models 53

E-step: Variational Distribution and Updates

Graphical Models 54

M-step: Parameter Estimation

Graphical Models 55

Results: Topics Inferred

Graphical Models 56

Results: Perplexity Comparison

Graphical Models 57

Graphical Models 60

Aviation Safety Reports (NASA)

Graphical Models 61

Results: NASA Reports I

Arrival Departure

Passenger Maintenance

runwayapproachdeparturealtitude

turntower

air traffic controlheadingtaxi way

flight

passengerattendant

flightseat

medicalcaptain

attendantslavatory

toldpolice

maintenance enginemel zzz

air craft installed

checkinspection

fuelWork

Graphical Models 62

Results: NASA Reports II

Medical Emergency

Wheel Maintenance

Weather Condition

Departure

medical

passenger

doctor

attendant

oxygen

emergency

paramedics

flight

nurse

aed

tire

wheel

assembly

nut

spacer

main

axle

bolt

missing

tires

knots

turbulence

aircraft

degrees

ice

winds

wind

speed

air speed

conditions

departure

sid

dme

altitude

climbing

mean sea level

heading

procedure

turn

degree

Graphical Models 63

The pilot flies an owner's airplane with the owner as a

passenger. Loses contact with the center during the flight.

While performing a sky diving, a jet approaches at the same altitude, but an accident is avoided.

Red: Flight Crew Blue: Passenger Green: Maintenance

Two-Dimensional Visualization for Reports

Graphical Models 64

Altimeter has a problem, but the pilot overcomes the difficulty during the flight.

During acceleration, a flap retraction issue happens. The pilot then returns

to base and lands. The mechanic finds out the problem.

Red: Flight Crew Blue: Passenger Green: Maintenance

Two-Dimensional Visualization for Reports

Graphical Models 65

The pilot has a landing gear problem. Maintenance crew joins radio conversation to

help.

The captain has a medical emergency.

Red: Flight crew Blue: Passenger Green: Maintenance

Two-Dimensional Visualization for Reports

Graphical Models 66

Red: Flight Crew Blue: Passenger Green: Maintenance

Mixed Membership of Reports

Flight Crew: 0.7039Passenger: 0.0009Maintenance: 0.2953

Flight Crew: 0.1405Passenger: 0.0663Maintenance: 0.7932

Flight Crew: 0.2563Passenger: 0.6599Maintenance: 0.0837

Flight Crew: 0.0013Passenger: 0.0013Maintenance: 0.9973

Graphical Models 67

Nd D

zi

xi

p (d)

(j)

p(d) Dirichlet()

zi Discrete(p (d) ) (j) Dirichlet()

xi Discrete( (zi) )

T

distribution over topicsfor each document

topic assignment for each word

distribution over words for each topic

word generated from assigned topic

Dirichlet priors

Smoothed Latent Dirichlet Allocation

Stochastic Inference using Markov Chains

• Powerful family of approximate inference methods– Markov Chain Monte Carlo, Gibbs Sampling

• The basic idea– Need to marginalize over complex latent variable distribution

p(x|) = ∫z p(x,z|) = ∫z p(x|) p(z|x,) = Ez~p(z|x,)[p(x|)]– Draw ‘independent’ samples from p(z|x,)

– Compute sample based average instead of the full integral

• Main Issue: How to draw samples?– Difficult to directly draw samples from p(z|x,)

– Construct a Markov chain whose stationary distribution is p(z|x,)

– Run chain till ‘convergence’

– Obtain samples from p(z|x,)

Graphical Models 68

The Metropolis-Hastings Algorithm

Graphical Models 69

The Metropolis-Hastings Algorithm (Contd)

Graphical Models 70

The Gibbs Sampler

Graphical Models 71

Collapsed Gibbs Sampling for LDA

Graphical Models 72

Collapsed Variational Inference for LDA

Graphical Models 73

Collapsed Variational Inference for LDA

Graphical Models 74

Results: Comparison of Inference Methods

Graphical Models 75

Results: Comparison of Inference Methods

Graphical Models 76

Generalizations

• Generalized Topic Models– Correlated Topic Models

– Dynamic Topic Models, Topics over Time

– Dynamic Topics with birth/death

• Mixed membership models over non-text data, applications– Mixed membership naïve-Bayes

– Discriminative models for classification

– Cluster Ensembles

• Nonparametric Priors– Dirichlet Process priors: Infer number of topics

– Hierarchical Dirichlet processes: Infer hierarchical structures

– Several other priors: Pachinko allocation, Gaussian Processes, IBP, etc.

Graphical Models 77

CTM Results

Graphical Models 78

DTM Results

Graphical Models 79

DTM Results II

Graphical Models 80

Graphical Models 81/41

Mixed Membership Naïve Bayes

• For each data point,– Choose π ~ Dirichlet()

• For each of observed features fn:

– Choose a class zn ~ Discrete (π)

– Choose a feature value xn from p(xn|zn,fn,Θ), which could be Gaussian, Poisson, Bernoulli…

Graphical Models 82

MMNB vs NB: Perplexity Surfaces

•MMNB typically achieves a lower perplexity than NB

•On test set, NB shows overfitting, but MMNB is stable and robust.

NB NB

NB

MMNB

MMNB

MMNB

Discriminative Mixed Membership Models

Graphical Models 83

84

Results: DLDA for text classification

Generally, Fast DLDA has a higher accuracy on most of the datasets

Graphical Models

85

Topics from DLDA

Graphical Models

cabin flight ice aircraft flight

descent hours aircraft gate smoke

pressurization time flight ramp cabin

emergency crew wing wing passenger

flight day captain taxi aircraft

aircraft duty icing stop captain

pressure rest engine ground cockpit

oxygen trip anti parking attendant

atc zzz time area smell

masks minutes maintenance line emergency

Graphical Models

• Combining multiple base clusterings of a dataset

• Robust and stable• Distributed and scalable• Knowledge reuse, privacy preserving

Cluster Ensembles

base clustering1 base clustering2 base clustering3

86

Graphical Models

Problem Formulation

• Input & Output

Data points

Base clusterings Consensus clustering

87

Graphical Models

Results: State-of-the-art vs Bayesian Ensembles

88

Part III: Graphical Models for Matrix Analysis

• Probabilistic Matrix Factorizations

• Probabilistic Co-clustering

• Stochastic Block Structures

Graphical Models 89

Matrix Factorization

• Singular value decomposition

• Problems– Large matrices, with millions of row/colums

• SVD can be rather slow

– Sparse matrices, most entries are missing• Traditional approaches cannot handle missing entries

Graphical Models 90

• Model X ϵ Rn×m as UVT where– U is a Rn×k, V is Rm×k

– Alternatively optimize U and V

Matrix Factorization: “Funk SVD”

Graphical Models 91

Xij = uiTvj =

error = (Xij –Xij)2 = (Xij –ui

Tvj)2

^

uiT

vj

^

• Gradient descent updates

Matrix Factorization (Contd)

Graphical Models 92

uik(t+1) = uik

(t) + η (Xij-Xij) vjk(t)

vjk(t+1) = vjk

(t) + η (Xij-Xij) ujk(t)

^

^

Xij = uiTvj =

error = (Xij -Xij )2

^

uiT

vj

^

Probabilistic Matrix Factorization (PMF)

Graphical Models 93

Xij ~ N(uiTvj , σ2)

uiT

vj

N(0, σu2I)

N(0, σv2I)

uiT ~ N(0, σu

2I)vj ~ N(0, σv

2I)Rij ~ N(ui

Tvj , σ2)

Inference using gradient descent

Bayesian Probabilistic Matrix Factorization

Graphical Models 94

Xij ~ N(uiTvj , σ2)

uiT

vj

N(µu, Λu)

N(µv, Λv)µu ~ N(µ0, Λ u), Λ u ~ W(ν0, W0)µv ~ N(µ0, Λ v), Λ v ~ W(ν0, W0)ui

~ N(µu, Λ u)vj ~ N(µv, Λ v)Rij ~ N(ui

Tvj , σ2)

Wishart

Gaussian

Inference using MCMC

Results: PMF on the Netflix Dataset

Graphical Models 95

Results: PMF on the Netflix Dataset

Graphical Models 96

Results: Bayesian PMF on Netflix

Graphical Models 97

Results: Bayesian PMF on Netflix

Graphical Models 98

Results: Bayesian PMF on Netflix

Graphical Models 99

Graphical Models 100

Co-clustering: Gene Expression Analysis

Original Co-clustered

Graphical Models 101

Co-clustering and Matrix Approximation

Graphical Models 102

Probabilistic Co-clustering

Row clusters:Column clusters:

Graphical Models 103

Probabilistic Co-clustering

Graphical Models 104

Generative Process

2

• Assume a mixed membership for each row and column

• Assume a Gaussian for each co-cluster

1. Pick row/column clusters

2. Generate each entry of the matrix

Graphical Models 105

Reduction to Mixture Models

3

Graphical Models 106

Reduction to Mixture Models

3 1.1

Graphical Models 107

Generative Process

2

• Assume a mixed membership for each row and column

• Assume a Gaussian for each co-cluster

1. Pick row/column clusters

2. Generate each entry of the matrix

Graphical Models 108

Bayesian Co-clustering (BCC)

2

• A Dirichlet distribution over all possible mixed memberships

Graphical Models 109

Bayesian Co-clustering (BCC)

Graphical Models 110

Learning: Inference and Estimation

• Learning– Estimate model parameters

– Infer ‘mixed memberships’ of individual rows and columns

• Expectation Maximization

• Issues– Posterior probability cannot be obtained in closed form

– Parameter estimation cannot be done directly

• Approach: Approximate inference– Variational Inference

– Collapsed Gibbs Sampling, Collapsed Variational Inference

),,( 21

Graphical Models 111

Variational EM

• Introduce a variational distribution to approximate

• Use Jensen’s inequality to get a tractable lower bound

• Maximize the lower bound w.r.t – Alternatively minimize the KL divergence between

and

• Maximize the lower bound w.r.t.

Graphical Models 112

Variational Distribution

• for each row, for each column

Collapsed Inference

• Latent distribution can be exactly marginalized over (1,2)

– Obtain p(X,z1,z2|1,2,) in closed form

– Analysis assumes discrete/categorical entries

– Can be generalized to exponential family distributions

• Collapsed Gibbs Sampling– Conditional distribution of (z1uv,z2uv) in closed form

P(z1uv=i, z2

uv=j | X, z1-uv, z2-uv, 1,2,

– Sample states, run sampler till convergence

• Collapsed Variational Bayes– Variational distribution q(z1,z2|) = ∏u,v q(z1

uv,z2uv|uv)

– Gaussian and Taylor approximation to obtain updates for uv

Graphical Models 113

Graphical Models 114

Residual Bayesian Co-clustering (RBC)

•(z1,z2) determines the distribution

•Users/movies may have bias

•(m1,m2): row/column means

•(bm1,bm2): row/ column bias

Graphical Models 115

Results: Datasets

• Movielens: Movie recommendation data – 100,000 ratings (1-5) for 1682 movies from 943 users (6.3%)

– Binarize: 0 (1-3), 1(4-5).

– Discrete (original), Bernoulli (binary), Real (z-scored)

• Foodmart: Transaction data– 164,558 sales records for 7803 customers and 1559 products (1.35%)

– Binarize: 0 (less than median), 1(higher than median)

– Poisson (original), Bernoulli (binary), Real (z-scored)

• Jester: Joke rating data– 100,000 ratings (-10.00,+10.00) for 100 jokes from 1000 users (100%)

– Binarize: 0 (lower than 0), 1 (higher than 0)

– Gaussian (original), Bernoulli (binary), Real (z-scored)

Graphical Models 116

MMNB BCC LDA

Jester 1.7883 1.8186 98.3742

Movielens 1.6994 1.9831 439.6361

Foodmart 1.8691 1.9545 1461.7463

MMNB BCC LDA

Jester 4.0237 2.5498 98.9964

Movielens 3.9320 2.8620 1557.0032

Foodmart 6.4751 2.1143 6542.9920

Training Set Test Set

On Binary Data

MMNB BCC

Jester 15.4620 18.2495

Movielens 3.1495 0.8068

Foodmart 4.5901 4.5938

MMNB BCC

Jester 39.9395 24.8239

Movielens 38.2377 1.0265

Foodmart 4.6681 4.5964

Training Set Test Set

Perplexity Comparison with 10 Clusters

On Original Data

Graphical Models 117

Co-embedding: Users

Graphical Models 118

Co-embedding: Movies

Graphical Models 119

RBC vs. other co-clustering algorithms

Jester

•RBC and RBC-FF perform better than BCC

•RBC and RBC-FF are also the best among others

Graphical Models 120

RBC vs. other co-clustering algorithms

Foodmart

Movielens

Graphical Models 121

RBC vs. SVD, NNMF, and CORR

Jester

•RBC and RBC-FF are competitive with other algorithms

Graphical Models 122

RBC vs. SVD, NNMF and CORR

Movielens

Foodmart

Graphical Models 123

SVD vs. Parallel RBC

Parallel RBC scales well to large matrices

Inference Methods: VB, CVB, Gibbs

Graphical Models 124

Mixed Membership Stochastic Block Models

• Network data analysis– Relational View: Rows and Columns are the same entity

– Example: Social networks, Biological networks

– Graph View: (Binary) adjacency matrix

• Model

Graphical Models 125

MMB Graphical Model

Graphical Models 126

Variational Inference

• Variational lower bound

• Fully factorized variational distribution

• Variational EM– E-step: Update variational parameters (,)

– M-step: Update model parameters (,B)

Graphical Models 127

Results: Inferring Communities

Graphical Models 128

Original friendship matrix Friendships inferred from the posterior, respectively based on thresholding p

TBq and pTBq

Results: Protein Interaction Analysis

Graphical Models 129

“Ground truth”: MIPS collection of protein interactions (yellow diamond)

Comparison with other models based on protein interactions and microarray expression analysis

Non-parametric Bayes

Graphical Models 130

Dirichlet Process Mixtures

Chinese Restaurant Processes

Indian Buffet Processes

Pittman-Yor Processes

Gaussian Processes

Hierarchical Dirichlet Processes

Mondrain Processes

References: Graphical Models

• S. Russell & P. Norvig, Artificial Intelligence: A Modern Approach, Prentice Hall, 2009.

• D. Koller & N. Friedman, Probabilistic Graphical Models: Principles and Techniques, MIT Press, 2009.

• C. Bishop, Pattern Recognition and Machine Learning, Springer, 2007.

• D. Barber, Bayesian Reasoning and Machine Learning, Cambridge University Press, 2010.

• M. I. Jordan (Ed), Learning in Graphical Models, MIT Press, 1998.

• S. L. Lauritzen, Graphical Models, Oxford University Press, 1996.

• J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, Morgan Kaufmann, 1988.

Graphical Models 131

References: Inference

• F. R. Kschischang, B. J. Frey, and H.-A. Loeliger, “Factor graphs and the

• sum-product algorithm,” IEEE Transactions on Information Theory, vol.47, no. 2, 498–519, 2001.

• S. M. Aji and R. J. McEliece, “The generalized distributive law,” IEEE Transactions on Information Theory, 46, 325–343, 2000.

• M. J. Wainwright and M. I. Jordan, “Graphical models, exponential families, and variational inference,” Foundations and Trends in Machine Learning, vol. 1, no. 1-2, 1-305, December 2008.

• C. Andrieu, N. De Freitas, A. Doucet, M. I. Jordan, “An Introduction to MCMC for Machine Learning,” Machine Learning, 50, 5-43, 2003.

• J. S. Yedidia, W. T. Freeman, and Y. Weiss, “Constructing free energy approximations and generalized belief propagation algorithms,” IEEE Transactions on Information Theory, vol. 51, no. 7, pp. 2282–2312, 2005.

Graphical Models 132

References: Mixed-Membership Models

• S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and R. Harshman. “Indexing by latent semantic analysis,” Journal of the Society for Information Science, 41(6):391–407, 1990.

• T. Hofmann, “Unsupervised learning by probabilistic latent semantic analysis,” Machine Learning, 42(1):177–196, 2001.

• D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent Dirichlet allocation,” Journal of Machine Learning Research (JMLR), 3:993–1022, 2003.

• T. L. Griffiths and M. Steyvers, “Finding scientific topics,” Proceedings of the National Academy of Sciences, 101(Suppl 1): 5228–5235, 2004.

• Y. W. Teh, D. Newman, and M. Welling. “A collapsed variational Bayesian inference algorithm for latent Dirichlet allocation, ” Neural Information Processing Systems (NIPS), 2007.

• A. Asuncion, P. Smyth, M. Welling, Y.W. Teh, “On Smoothing and Inference for Topic Models,” Uncertainty in Artificial Intelligence (UAI), 2009.

• H. Shan, A. Banerjee, and N. Oza, “Discriminative Mixed-membership Models,”IEEE Conference on Data Mining (ICDM), 2009.

Graphical Models 133

References: Matrix Factorization

• S. Funk, “Netflix update: Try this at home,” http://sifter.org/~simon/journal/20061211.html

• R. Salakhutdinov and A. Mnih. “Probabilistic matrix factorization,” Neural Information Processing Systems (NIPS), 2008.

• R. Salakhutdinov and A. Mnih. “Bayesian probabilistic matrix factorization using Markov chain Monte Carlo,” International Conference on Machine Learning (ICML), 2008.

• I. Porteous, A. Asuncion, and M. Welling, “Bayesian matrix factorization with side information and Dirichlet process mixtures,” Conference on Artificial Intelligence (AAAI), 2010.

• I. Sutskever, R. Salakhutdinov, and J. Tenenbaum. “Modelling relational data using Bayesian clustered tensor facotrization,” Neural Information Processing Systems (NIPS), 2009.

• A. Singh and G. Gordon,  “A Bayesian matrix factorization model for relational data,” Uncertainty in Artificial Intelligence (UAI), 2010.

Graphical Models 134

References: Co-clustering, Block Structures

• A. Banerjee, I. Dhillon, J. Ghosh, S. Merugu, D. Modha., “A Generalized Maximum Entropy Approach to Bregman Co-clustering and Matrix Approximation,” Journal of Machine Learning Research (JMLR), 2007.

• M. M. Shafiei and E. E. Milios, “Latent Dirichlet Co-Clustering,” IEEE Conference on Data Mining (ICDM), 2006.

• H. Shan and A. Banerjee, “Bayesian co-clustering,” IEEE International Conference on Data Mining (ICDM), 2008.

• P. Wang, C. Domeniconi, and K. B. Laskey, “Latent Dirichlet Bayesian Co-Clustering,” European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD), 2009.

• H. Shan and A. Banerjee, “Residual Bayesian Co-clustering for Matrix Approximation,” SIAM International Conference on Data Mining (SDM), 2010.

• T. A. B. Snijders and K. Nowicki, “Estimation and prediction for stochastic blockmodels for graphs with latent block structure,” Journal of Classification, 14:75–100, 1997.

• E.M. Airoldi, D. M. Blei, S. E. Fienberg, and E. P. Xing, “Mixed-membership stochastic blockmodels,”  Journal of Machine Learning Research (JMLR),  9, 1981-2014, 2008.

Graphical Models 135

Graphical Models 136

Acknowledgements

Hanhuai Shan Amrudin Agovic

Graphical Models 137