L11: Uses of Bayesian Networks

57
L11: Uses of Bayesian Networks Nevin L. Zhang Department of Computer Science & Engineering The Hong Kong University of Science & Technology http://www.cse.ust.hk/~lzhang/

description

L11: Uses of Bayesian Networks. Nevin L. Zhang Department of Computer Science & Engineering The Hong Kong University of Science & Technology http://www.cse.ust.hk/~lzhang/. Outline. Traditional Uses Structure Discovery Density Estimation Classification Clustering An HKUST Project. - PowerPoint PPT Presentation

Transcript of L11: Uses of Bayesian Networks

Page 1: L11: Uses of Bayesian Networks

L11: Uses of Bayesian Networks

Nevin L. Zhang

Department of Computer Science & Engineering

The Hong Kong University of Science & Technology

http://www.cse.ust.hk/~lzhang/

Page 2: L11: Uses of Bayesian Networks

Page 2

Outline

Traditional Uses

Structure Discovery

Density Estimation

Classification

Clustering

An HKUST Project

Page 3: L11: Uses of Bayesian Networks

Page 3

Traditional Uses

Probabilistic Expert Systems Diagnostic Prediction

Example:BN for diagnosing

“blue baby” over phone

in a London Hospital

Comparable to specialist,

Better than others

Page 4: L11: Uses of Bayesian Networks

Page 4

Traditional Uses

Language for describing probabilistic models in Science &

Engineering Example: BN for turbo code

Page 5: L11: Uses of Bayesian Networks

Page 5

Traditional Uses

Language for describing probabilistic models in Science &

Engineering Example: BN from Bioinformatics

Page 6: L11: Uses of Bayesian Networks

Page 6

Outline

Traditional Uses

Structure Discovery

Density Estimation

Classification

Clustering

An HKUST Project

Page 7: L11: Uses of Bayesian Networks

Page 7

BN for Structure Discovery

Given: Data set D on variables X1, X2, …, Xn

Discover dependence, independence, and even causal

relationship among the variable.

Example: Evolution trees

Page 8: L11: Uses of Bayesian Networks

Page 8

Phylogenetic Trees

Assumption All organisms on Earth have a common ancestor

This implies that any set of species is related.

Phylogeny The relationship between any set of species.

Phylogenetic tree Usually, the relationship can be represented by a tree which is called a

phylogenetic (evolution) tree this is not always true

Page 9: L11: Uses of Bayesian Networks

Page 9

Phylogenetic Trees

Phylogenetic trees

giant

pandalesser

panda

moose

goshawk vulture

duck

alligator

Time

Current-day species at bottom

Page 10: L11: Uses of Bayesian Networks

Page 10

Phylogenetic Trees

TAXA (sequences) identify species

Edge lengths represent evolution time

Assumption: bifurcating tree topology

Time

AGGGCAT

TAGCCCA

TAGACTT

AGCACAA

AGCGCTT

AAGACTT

AGCACTT

AAGGCCT

AAGGCAT

Page 11: L11: Uses of Bayesian Networks

Page 11

Characterize relationship between taxa using substitution probability:P(x | y, t): probability that ancestral sequence y evolves into sequence x along

an edge of length t

P(X7), P(X5|X7, t5), P(X6|X7, t6), P(S1|X5, t1), P(S2|X5, t2), ….

Probabilistic Models of Evolution

s3 s4s1 s2

t5 t6

t1 t2 t3 t4x5 x6

x7

Page 12: L11: Uses of Bayesian Networks

Page 12

What should P(x|y, t) be? Two assumptions of commonly used models

There are only substitutions, no insertions/deletions (aligned) One-to-one correspondence between sites in different sequences

Each site evolves independently and identically

P(x|y, t) = ∏i=1 to m P(x(i) | y(i), t)

m is sequence length

Probabilistic Models of Evolution

AGGGCAT

TAGCCCA

TAGACTTAGCACAA

AGCGCTT

AAGACTT

AGCACTT

AAGGCCT

AAGGCAT

Page 13: L11: Uses of Bayesian Networks

Page 13

What should P(x(i)|y(i), t) be? Jukes-Cantor (Character Evolution) Model [1969]

Rate of substitution a (Constant or parameter?)

Multiplicativity (lack of memory)

Probabilistic Models of Evolution

rt st st st

st rt st st

st st rt st

st st st rt

A

C

G

T

A C G Trt = 1/4 (1 + 3e-4t)

st = 1/4 (1 - e-4t)

Limit values when

t = 0 or t = infinity?

b

tbcPtabPttacP ),|(),|(),|( 2121

Page 14: L11: Uses of Bayesian Networks

Page 14

Tree Reconstruction

Given: collection of current-day taxa

Find: tree Tree topology: T Edge lengths: t

Maximum likelihood Find tree to maximize P(data

| tree)

AGGGCAT

TAGCCCA

TAGACTT

AGCACAA

AGCGCTT

AGGGCAT, TAGCCCA, TAGACTT, AGCACAA, AGCGCTT

Page 15: L11: Uses of Bayesian Networks

Page 15

When restricted to one particular site, a phylogenetic tree is an LT

model where The structure is a binary tree and variables share the same state space.

The conditional probabilities are from the character evolution model,

parameterized by edge lengths instead of usual parameterization.

The model is the same for different sites

Tree Reconstruction

AGGGCAT

TAGCCCA

TAGACTT

AGCACAA

AGCGCTT

AAGACTT

AGCACTT

AAGGCCT

Page 16: L11: Uses of Bayesian Networks

Page 16

Tree Reconstruction

Current-day Taxa: AGGGCAT, TAGCCCA, TAGACTT, AGCACAA, AGCGCTT

Samples for LT model. One Sample per site. The samples are i.i.d. 1st site: (A, T, T, A, A),

2nd site: (G, A, A, G, G),

3rd site: (G, G, G, C, C),

AGGGCAT

TAGCCCA

TAGACTT

AGCACAA

AGCGCTT

AAGACTT

AGCACTT

AAGGCCT

Page 17: L11: Uses of Bayesian Networks

Page 17

Tree Reconstruction

Finding ML phylogenetic tree == Finding ML LT model

Model space: Model structures: binary tree where all variables share the same state

space, which is known.

Parameterization: one parameter for each edge. (In general, P(x|y) has |x||

y|-1 parameters).

The objective is to find relationships among variables.

Applying new LTM algorithms to Phylogenetic tree reconstruction?

Page 18: L11: Uses of Bayesian Networks

Page 18

Outline

Traditional Uses

Structure Discovery

Density Estimation

Classification

Clustering

An HKUST Project

Page 19: L11: Uses of Bayesian Networks

Page 19

BN for Density Estimation

Given: Data set D on variables X1, X2, …, Xn

Estimate: P(X1, X2, …, Xn) under some constraints

..

Uses of the estimate: Inference

Classification

Page 20: L11: Uses of Bayesian Networks

Page 20

BN Methods for Density Estimation

Chow-Liu tree with X1, X2, …, Xn as nodes Easy to compute

Easy to use

Might not be good estimation of “true” distribution

BN with X1, X2, …, Xn as nodes Can be good estimation of “true” distribution.

Might be difficult to find

Might be complex to use

Page 21: L11: Uses of Bayesian Networks

Page 21

BN Methods for Density Estimation

LC model with X1, X2, …, Xn as manifest variables (Lowd

and Domingos 2005) Determine the cardinality of the latent variable using hold-out

validation,

Optimize the parameters using EM.

..

Easy to compute

Can be good estimation of “true” distribution

Might be complex to use (cardinality of latent variable might be very

large)

Page 22: L11: Uses of Bayesian Networks

Page 22

BN Methods for Density Estimation

LT model for density estimation Pearl 1988: As model over manifest variables, LTMs

Are computationally very simple to work with. Can represent complex relationships among manifest variables.

Page 23: L11: Uses of Bayesian Networks

Page 23

BN Methods for Density Estimation New approximate inference algorithm for Bayesian networks (Wang, Zhang

and Chen, AAAI 08, JAIR 32: 879-900, 08 )

Sample Learn

sparse sparse dense dense

Page 24: L11: Uses of Bayesian Networks

Page 24

Outline

Traditional Uses

Structure Discovery

Density Estimation

Classification

Clustering

An HKUST Project

Page 25: L11: Uses of Bayesian Networks

Page 25

Bayesian Networks for Classification

The problem: Given data:

Find mapping (A1, A2, …, An) |- C

Possible solutions ANN

Decision tree (Quinlan)

(SVM: Continuous data)

A1 A2 … An C

0 1 … 0 T

1 0 … 1 F

.. .. .. .. ..

Page 26: L11: Uses of Bayesian Networks

Page 26

Bayesian Networks for Classification

Naïve Bayes model From data, learn

P(C), P(Ai|C)

Classification arg max_c P(C=c|A1=a1, …, An=an)

Very good in practice

Page 27: L11: Uses of Bayesian Networks

Page 27

Drawback of NB: Attributes mutually independent given class variable

Often violated, leading to double counting.

Fixes: General BN classifiers

Tree augmented Naïve Bayes (TAN) models

Hierarchical NB

Bayes rule + Density Estimation

Bayesian Networks for Classification

Page 28: L11: Uses of Bayesian Networks

Page 28

General BN classifier Treat class variable just as another variable

Learn a BN.

Classify the next instance based on values of variables in the Markov

blanket of the class variable.

Pretty bad because it does not utilize all available information because of

Markov boundary

Bayesian Networks for Classification

Page 29: L11: Uses of Bayesian Networks

Page 29

Bayesian Networks for Classification

TAN model Friedman, N., Geiger, D., and Goldszmidt, M. (1997). 

Bayesian networks classifiers. Machine Learning, 29:131-163.

Capture dependence among attributes using a tree structure.

During learning, First learn a tree among attributes: use Chow-Liu algorithm

Add class variable and estimate parameters

Classification arg max_c P(C=c|A1=a1, …, An=an)

Page 30: L11: Uses of Bayesian Networks

Page 30

Bayesian Networks for Classification

Hierarchical Naïve Bayes models N. L. Zhang, T. D. Nielsen, and F. V. Jensen (2002).

Latent variable discovery in  classification models. Artificial Intelligence

in Medicine, to appear.

Capture dependence among attributes using latent variables

Detect interesting latent structures besides classification

Algorithm in the step of DHC..

Page 31: L11: Uses of Bayesian Networks

Page 31

Bayesian Networks for Classification

Bayes Rule

. Chow-Liu

LC model

LT Model

Wang Yi: Bayes rule + LT model is for superior

Page 32: L11: Uses of Bayesian Networks

Page 32

Outline

Traditional Uses

Structure Discovery

Density Estimation

Classification

Clustering

An HKUST Project

Page 33: L11: Uses of Bayesian Networks

Page 33

BN for Clustering

Latent class (LC) model One latent variable A set of manifest variables

Conditional Independence Assumption: Xi’s mutually independent given Y. Also known as Local Independence Assumption

Used for cluster analysis of categorical data Determine cardinality of Y: number of clusters Determine P(Xi|Y): characteristics of clusters

Page 34: L11: Uses of Bayesian Networks

Page 34

BN for Clustering

Clustering Criteria

Distance based clustering: Minimizes intra-cluster variation and/or maximizes inter-cluster

variation

LC Model-based clustering: The criterion follows from the conditional independence assumption

Divide data into clusters such that, in each cluster, manifest

variables are mutually independent under the empirical distribution.

Page 35: L11: Uses of Bayesian Networks

Page 35

BN for Clustering

Local independence assumption often not true

LT models generalize LC models Relax the independence assumption

Each latent variable gives a way to partition data… multidimensional

clustering

Page 36: L11: Uses of Bayesian Networks

Page 36

ICAC Data

// 31 variables, 1200 samples

C_City: s0 s1 s2 s3 // very common, quit common, uncommon, ..

C_Gov: s0 s1 s2 s3

C_Bus: s0 s1 s2 s3

Tolerance_C_Gov: s0 s1 s2 s3 //totally intolerable, intolerable, tolerable,...

Tolerance_C_Bus: s0 s1 s2 s3

WillingReport_C: s0 s1 s2 // yes, no, depends

LeaveContactInfo: s0 s1 // yes, no

I_EncourageReport: s0 s1 s2 s3 s4 // very sufficient, sufficient, average, ...

I_Effectiveness: s0 s1 s2 s3 s4 //very e, e, a, in-e, very in-e

I_Deterrence: s0 s1 s2 s3 s4 // very sufficient, sufficient, average, ...

…..

-1 -1 -1 0 0 -1 -1 -1 -1 -1 -1 0 -1 -1 -1 0 1 1 -1 -1 2 0 2 2 1 3 1 1 4 1 0 1.0

-1 -1 -1 0 0 -1 -1 1 1 -1 -1 0 0 -1 1 -1 1 3 2 2 0 0 0 2 1 2 0 0 2 1 0 1.0

-1 -1 -1 0 0 -1 -1 2 1 2 0 0 0 2 -1 -1 1 1 1 0 2 0 1 2 -1 2 0 1 2 1 0 1.0

….

Page 37: L11: Uses of Bayesian Networks

Page 37

Latent Structure Discovery

Y2: Demographic info; Y3: Tolerance toward corruption

Y4: ICAC performance; Y7: ICAC accountability

Y5: Change in level of corruption; Y6: Level of corruption

(Zhang, Poon, Wang and Chen 2008)

Page 38: L11: Uses of Bayesian Networks

Page 38

Interpreting Partition

Y2 partition the population into 4 clusters

What is the partition about? What is “criterion”?

On what manifest variables do the clusters differ the most?

Mutual information: The larger I(Y2, X), the more the 4 clusters differ on X

Page 39: L11: Uses of Bayesian Networks

Page 39

Interpreting Partition

Information curves: Partition of Y2 is based on Income, Age, Education, Sex

Interpretation: Y2 --- Represents a partition of the population based

on demographic information

Y3 --- Represents a partition based on Tolerance toward Corruption

Page 40: L11: Uses of Bayesian Networks

Page 40

Interpreting Clusters

Y2=s0: Low income youngsters; Y2=s1: Women with no/low income

Y2=s2: people with good education and good income;

Y2=s3: people with poor education and average income

Page 41: L11: Uses of Bayesian Networks

Page 41

Interpreting Clustering

Y3=s0: people who find corruption totally intolerable; 57%

Y3=s1: people who find corruption intolerable; 27%

Y3=s2: people who find corruption tolerable; 15%

Interesting finding:

Y3=s2: 29+19=48% find C-Gov totally intolerable or intolerable; 5% for C-Bus

Y3=s1: 54% find C-Gov totally intolerable; 2% for C-Bus

Y3=s0: Same attitude toward C-Gov and C-Bus

People who are touch on corruption are equally tough toward C-Gov and C-Bus.

People who are relaxed about corruption are more relaxed toward C-Bus than C-GOv

Page 42: L11: Uses of Bayesian Networks

Page 42

Relationship Between Dimensions

Interesting finding: Relationship btw background and tolerance toward corruption

Y2=s2: ( good education and good income) the least tolerant. 4% tolerable

Y2=s3: (poor education and average income) the most tolerant. 32% tolerable

The other two classes are in between.

Page 43: L11: Uses of Bayesian Networks

Page 43

Result of LCA

Partition not meaningful

Reason: Local Independence not true

Another way to look at it LCA assumes that all the manifest variables joint defines a meaningful way to cluster

data Obviously not true for ICAC data Instead, one should look for subsets that do define meaningful partition and perform

cluster analysis on them This is what we do with LTA

Page 44: L11: Uses of Bayesian Networks

Page 44

Finite Mixture Models

Y: discrete latent variable Xi: continuous P(X1, X2, …, Xn|Y): Usually

multivariate Gaussian No independence assumption

Assume states of Y: 1, 2, …, k

P(X1, X2, …, Xn)

= P(Y=i)P(X1, X2, …, Xn|Y=i):

Mixture of k Gaussian components

Page 45: L11: Uses of Bayesian Networks

Page 45

Finite Mixture Models

Used to cluster continuous data

Learning Determine

k: number of clusters

P(Y)

P(X1, …, Xn|Y)

Also assume: All attributes define coherent partition Not realistic

LT models are a natural framework for clustering high dimensional data

Page 46: L11: Uses of Bayesian Networks

Page 46

Outline

Traditional Uses

Structure Discovery

Density Estimation

Classification

Clustering

An HKUST Project

Page 47: L11: Uses of Bayesian Networks

Page 47

Observation on How Human Brain Does Thinking

Human beings often invoke latent variables to explain regularities that

we observe.

Example 1 Observe Regularity:

Beers, Diapers often bought together in early evening

Hypothesize (latent) cause: There must be a common (latent) cause

Identify the cause and explain regularity Shopping by Father of Babies on the way home from work

Based on our understanding of the world

Page 48: L11: Uses of Bayesian Networks

Page 48

Example 2 Background: At night, watch lighting throw windows of apartments

in big buildings

Observe Regularity: Lighting from several apartments were changing in brightness and color

at the same times and in perfect synchrony.

Hypothesize common (latent) cause: There must be a (late) common cause

Identify the cause and explain the phenomenon: People watching the same TV channel.

Based on understanding of the world

Observation on How Human Brain Does Thinking

Page 49: L11: Uses of Bayesian Networks

Page 49

Back to Ancient Time

Observe Regularity Several symptoms often occur together

‘intolerance to cold’, ‘cold limbs’, and ‘cold lumbus and back’

Hypothesize common latent cause: There must be a common latent cause

Identify the cause Answer based on understanding of world at that time, primitive

Conclusion: Yang deficiency (阳虚 ) Explanation: Yang is like the sun, it warms your body. If you don’t have enough

of it, feel cold.

Page 50: L11: Uses of Bayesian Networks

Page 50

Back to Ancient Time

Regularity observed: Several symptoms often occur together

Tidal fever (潮热 ) , heat sensation in palm and feet (手足心热 ), palpitation (心慌心跳 ), thready and rapid pulse (脉细数 )

Hypothesize common latent cause: There must be a common latent cause

Identify the cause and explain the regularirt Yin deficiency causing internal heart (阴虚内热 )

Yin and Yang should be in balance. If Yin is in deficiency, Yang will be in excess

relatively, and hence causes heat.

Page 51: L11: Uses of Bayesian Networks

Page 51

Traditional Chinese Medicine (TCM)

Claim TCM Theories = Statistical Regularities + Subjective Interpretations

How to justify the claim

Page 52: L11: Uses of Bayesian Networks

Page 52

A Case Study

We collected a data set about kidney deficiency (肾虚 )

35 symptom variables, 2600 records

Page 53: L11: Uses of Bayesian Networks

Page 53Result of Data Analysis

Y0-Y34: manifest variables from data

X0-X13: latent variables introduced by data analysis

Structure interesting, supports TCM’s theories about various symptoms.

Page 54: L11: Uses of Bayesian Networks

Page 54

Other TCM Data Sets

From Beijing U of TCM, 973 project Depression Hepatitis B Chronic Renal Failure COPD Menopause

China Academy of TCM Subhealth Diabetes

In all cases, results of LT analysis match relevant TCM Theories

Page 55: L11: Uses of Bayesian Networks

Page 55

Result on the Depression Data

Page 56: L11: Uses of Bayesian Networks

Page 56

Significance

Conclusion TCM Theories = Statistical Regularities + Subjective Interpretations

Significance TCM theories are partially based on objective facts

Boast user confidence

Can help to lay a modern statistical foundation for TCM Systematically identify statistical regularities about occurrence of

symptoms, find natural partitions

Establish objective and quantitative diagnosis standards

Assist in double-blind experiments for evaluate and improve the efficacy

of TCM treatment

Page 57: L11: Uses of Bayesian Networks

Page 57

Outline

Traditional Uses

Structure Discovery

Density Estimation

Classification

Clustering

An HKUST Project