Non parametric bayesian learning in discrete data

24
Non-parametric Bayesian Learning in Discrete Data Yueshen Xu [email protected] / [email protected] Middleware, CCNT, ZJU Middleware, CCNT, ZJU 5/10/2016 Statistics & Computational Linguistics 1 Yueshen Xu

Transcript of Non parametric bayesian learning in discrete data

Page 1: Non parametric bayesian learning in discrete data

Non-parametric Bayesian

Learning in Discrete Data

Yueshen [email protected] / [email protected]

Middleware, CCNT, ZJU

Middleware, CCNT, ZJU5/10/2016

Statistics & Computational Linguistics

1Yueshen Xu

Page 2: Non parametric bayesian learning in discrete data

Outline

Bayesโ€™ Rule

Parametric Bayesian Learning

Concept & Example

Discrete & Continuous Data

Text Clustering & Topic Modeling

Pros and Cons

Some Important Concepts

Non-parametric Bayesian Learning

Dirichlet Process and Process Construction

Dirichlet Process Mixture

Hierarchical Dirichlet Process

Chinese Restaurant Process

5/10/2016 2 Middleware, CCNT, ZJUYueshen Xu

Example: Hierarchical Topic

Modeling

Markov Chain Monte Carlo

Reference

Discussion

Page 3: Non parametric bayesian learning in discrete data

Bayesโ€™ Rule

Posterior = Prior * Likelihood

5/10/2016 Yueshen Xu 3 Middleware, CCNT, ZJU

๐‘ ๐ป๐‘ฆ๐‘๐‘œ๐‘กโ„Ž๐‘’๐‘ ๐‘–๐‘  ๐ท๐‘Ž๐‘ก๐‘Ž =๐‘ ๐ท๐‘Ž๐‘ก๐‘Ž ๐ป๐‘ฆ๐‘๐‘œ๐‘กโ„Ž๐‘’๐‘ ๐‘–๐‘  ๐‘(๐ป๐‘ฆ๐‘๐‘œ๐‘กโ„Ž๐‘’๐‘ ๐‘–๐‘ )

๐‘(๐ท๐‘Ž๐‘ก๐‘Ž)

Posterior

Likelihood Prior

Evidence

Update beliefs in hypotheses in response to data

Parametric or Non-parametric

The structure of hypothesis: constrain or not constrain

We have examples later

Your confidence to the prior

Page 4: Non parametric bayesian learning in discrete data

Parametric Bayesian Learning

5/10/2016 Yueshen Xu 4 Middleware, CCNT, ZJU

๐‘ ๐ป๐‘ฆ๐‘๐‘œ๐‘กโ„Ž๐‘’๐‘ ๐‘–๐‘  ๐ท๐‘Ž๐‘ก๐‘Ž โˆ ๐‘ ๐ท๐‘Ž๐‘ก๐‘Ž ๐ป๐‘ฆ๐‘๐‘œ๐‘กโ„Ž๐‘’๐‘ ๐‘–๐‘  ๐‘(๐ป๐‘ฆ๐‘๐‘œ๐‘กโ„Ž๐‘’๐‘ ๐‘–๐‘ )

Parametric or Non-parametric Hypothesis

Evidence is the fact

Constant No possibility Trick commonly used

Non-parametric != No parameters

Hyper-parameters

โ€ข Parameters of distributions

โ€ข Parameter vs. Variable

๐ท๐‘–๐‘Ÿ ๐œƒ ๐œถ =ฮ“(๐›ผ0)

ฮ“ ๐›ผ1 โ€ฆฮ“ ๐›ผ๐พ

๐‘˜=1

๐พ

๐œƒ๐‘˜๐›ผ๐‘˜โˆ’1

Variable

Hyper-parameter Parameter

p(ฮธ|X) โˆ p(X|ฮธ)p(ฮธ)

Page 5: Non parametric bayesian learning in discrete data

Parametric Bayesian Learning

Some Examples

5/10/2016 Yueshen Xu 5 Middleware, CCNT, ZJU

Clustering Topic Modeling

K-Means/Medoid, NMF LSA, pLSA, LDA

Hierarchical Concept Building

Page 6: Non parametric bayesian learning in discrete data

Parametric Bayesian Learning

Serious Problems

How could we know

the number of clusters?

the number of topics?

the number of layers?

5/10/2016 Yueshen Xu 6 Middleware, CCNT, ZJU

Heuristic pre-processing?

Guessing and Tuning

Page 7: Non parametric bayesian learning in discrete data

Parametric Bayesian Learning

Some basics

Discrete Data & Continuous Data

Discrete Data: text be modeled as natural numbers

Continuous Data: stock, trading, signal, quality, rating be

modeled as real numbers

5/10/2016 Yueshen Xu 7 Middleware, CCNT, ZJU

Some important concepts (Also used in non-parametric case)

Discrete distribution: ๐‘‹๐‘–|๐œƒ~๐ท๐‘–๐‘ ๐‘๐‘Ÿ๐‘’๐‘ก๐‘’(๐œƒ)

๐‘ ๐‘‹ ๐œƒ =

๐‘–=1

๐‘›

๐ท๐‘–๐‘ ๐‘๐‘Ÿ๐‘’๐‘ก๐‘’ ๐‘‹๐‘–; ๐œƒ =

๐‘—=1

๐‘š

๐œƒ๐‘—

๐‘๐‘—

Multinomial distribution: ๐‘|๐‘›, ๐œƒ~๐‘€๐‘ข๐‘™๐‘ก๐‘–(๐œƒ, ๐‘›)

๐‘ ๐‘ ๐‘›, ๐œƒ =๐‘›!

๐‘—=1๐‘š ๐‘๐‘—!

๐‘—=1

๐‘š

๐œƒ๐‘—

๐‘๐‘—

Computer Sciencers

often mix them up

Page 8: Non parametric bayesian learning in discrete data

Parametric Bayesian Learning

Some important concepts (cont.)

Dirichlet distribution:๐œƒ|๐œถ~๐ท๐‘–๐‘Ÿ(๐œถ)

๐ท๐‘–๐‘Ÿ ๐œƒ ๐œถ =ฮ“(๐›ผ0)

ฮ“ ๐›ผ1 โ€ฆฮ“ ๐›ผ๐พ

๐‘˜=1

๐พ

๐œƒ๐‘˜๐›ผ๐‘˜โˆ’1

Conjugate Prior

the posterior p(ฮธ|X) are in the same family as the p(ฮธ), the prior is called

a conjugate prior of the likelihood p(X|ฮธ)

Examples

Binomial Distribution โ†โ†’ Beta Distribution

Multinomial Distribution โ†โ†’ Dirichlet Distribution

5/10/2016 Yueshen Xu 8 Middleware, CCNT, ZJU

๐‘ ๐œƒ ๐‘ต, ๐œถ =๐ท๐‘–๐‘Ÿ ๐œƒ ๐‘ต + ๐œถ =ฮ“(๐›ผ0+๐‘)

ฮ“ ๐›ผ1+๐‘1 โ€ฆฮ“ ๐›ผ๐พ+๐‘๐พ ๐‘˜=1๐พ ๐œƒ๐‘˜๐›ผ๐‘˜โˆ’1+๐‘๐‘˜

๐‘(๐œƒ|๐œถ) ๐‘ ๐‘ต ๐œƒ

Why should prior and

posterior better be

conjugate distributions?

โ€ฆ

Page 9: Non parametric bayesian learning in discrete data

Parametric Bayesian Learning

Some important concepts (cont.)

Probabilistic Graphical Model

Modeling Bayesian Network using plates and circles

5/10/2016 Yueshen Xu 9 Middleware, CCNT, ZJU

Generative Model & Discriminative Model: ๐‘(๐œƒ|๐‘‹)

Generative Model: p(ฮธ|X) โˆ p(X|ฮธ)p(ฮธ)

Naรฏve Bayes, GMM, pLSA, LDA, HMM, HDPโ€ฆ : Unsupervised Learning

Discriminative Model: ๐‘(๐œƒ|๐‘‹)

LR, KNN,SVM, Boosting, Decision Tree : Supervised Learning

Also have graphical model

representations

Page 10: Non parametric bayesian learning in discrete data

Non-parametric Bayesian Learning

When we talk about non-parametric, what do we usually talk

about?

Discrete Data: Dirichlet Distribution, Dirichlet Process, Chinese

Restaurant Process, Polya Urn, Pitman-Yor Process, Hierarchical

Dirichlet Process, Dirichlet Process Mixture, Dirichlet Process

Multinomial Model, Clustering, โ€ฆ

Continuous Data: Gaussian Distribution, Gaussian Process,

Regression, Classification, Factorization, Gradient Descent,

Covariance Matrixโ€ฆ Brownian Motion

5/10/2016 Yueshen Xu 10 Middleware, CCNT, ZJU

Infinite

โˆž

Page 11: Non parametric bayesian learning in discrete data

Non-parametric Bayesian Learning

Dirichlet Process[Yee Whye Teh, etc]

๐บ0 : probabilistic measure/distribution (base distribution), ๐›ผ0: real

number, (๐ด1, ๐ด2, โ€ฆ , ๐ด๐‘Ÿ) : partition of space, G: a probabilistic

distribution, iff

(๐บ ๐ด1 , โ€ฆ , ๐บ(๐ด๐‘Ÿ))~๐ท๐‘–๐‘Ÿ(๐›ผ0๐บ0 ๐ด1 , โ€ฆ , ๐›ผ0๐บ0 ๐ด๐‘Ÿ )

then, ๐บ~DP(๐›ผ0, ๐บ0)

5/10/2016, Yueshen Xu 11 Middleware, CCNT, ZJU

๐บ0 : which exact distribution is ๐บ0? We donโ€™t know

๐บ : which exact distribution is ๐บ? We donโ€™t know

Page 12: Non parametric bayesian learning in discrete data

Non-parametric Bayesian Learning

Where is infinite? Construction of DP We need to construct

a DP, since it does not exist naturally

Stick-breaking, Polya Urn Scheme, Chinese restaurant process

Middleware, CCNT, ZJU

Stick-breaking construction

(๐›ฝ๐‘˜)๐‘˜=1โˆž ,(๐œ™๐‘˜)๐‘˜=1

โˆž :iid sequence

๐‘˜=1โˆž ๐œ‹๐‘˜ = 1 ๐›ฟ๐œ™๐‘˜ is the probability of ๐œ™๐‘˜

a distribution of positive integers

๐›ฝ๐‘˜|๐›ผ0~๐ต๐‘’๐‘ก๐‘Ž(1, ๐›ผ0)๐œ™๐‘˜|๐›ผ0~๐บ0

๐œ‹๐‘˜ = ๐›ฝ๐‘˜

๐‘™=1

๐‘˜โˆ’1

(1 โˆ’ ๐›ฝ๐‘™)

๐บ =

๐‘˜=1

โˆž

๐œ‹๐‘˜ ๐›ฟ๐œ™๐‘˜

Why DP? โ€ฆ

Page 13: Non parametric bayesian learning in discrete data

Non-parametric Bayesian Learning

Chinese Restaurant Process

A restaurant with an infinite number of tables, and customers

(word, generated from ๐œƒ๐‘–, one-to-one) enter this restaurant

sequentially. The ith customer (๐œƒ๐‘–) sits at a table (๐œ™๐‘˜) according to

the probability :

5/10/2016 Yueshen Xu 13 Middleware, CCNT, ZJU

new table

๐œ™๐‘˜: Clustering == 2/3 unsupervised learning clustering, topic modeling (two layer

clustering), hierarchical concept building, collaborative filtering, similarity computationโ€ฆ

Page 14: Non parametric bayesian learning in discrete data

Non-parametric Bayesian Learning

Dirichlet Process Mixture (DPM)

You can draw the graphical model yourself DP is not enough

We need similarity instead of cloning Mixture Models

Middleware, CCNT, ZJU

Mixture Models: an element is generated from a mixture/group of

variables (usually latent variables) โˆถ GMM, LDA, pLSAโ€ฆ

DPM: ๐œƒ๐‘–|๐บ~๐บ, ๐‘ฅ๐‘–|๐œƒ๐‘–~๐น(๐œƒ๐‘–) For text data, ๐น(๐œƒ๐‘–) is Discrete/Multinomial

Intuitive but not helpful

Construction

๐›ฝ๐‘˜|๐›ผ0~๐ต๐‘’๐‘ก๐‘Ž(1, ๐›ผ0)๐œ™๐‘˜|๐›ผ0~๐บ0

๐œ‹๐‘˜ = ๐›ฝ๐‘˜

๐‘™=1

๐‘˜โˆ’1

(1 โˆ’ ๐›ฝ๐‘™)

๐บ =

๐‘˜=1

โˆž

๐œ‹๐‘˜ ๐›ฟ๐œ™๐‘˜

Page 15: Non parametric bayesian learning in discrete data

Non-parametric Bayesian Learning

Dirichlet Process Mixture (DPM)

5/10/2016 Yueshen Xu 15 Middleware, CCNT, ZJU

Finite

Dirichlet Multinomial

Mixture Model

What can DMMM do?

(0,0,0,Caption,0,0,0,0,0,0,USA,0,0,0,0,0,0,0,0,0,Action,0,0,0,0,0,0,0,Hero,0,0 0,0,0,0,โ€ฆ.)

C l u s t e r i n g

Page 16: Non parametric bayesian learning in discrete data

Non-parametric Bayesian Learning

Hierarchical Dirichlet Process (HDP)

5/10/2016 Yueshen Xu 16 Middleware, CCNT, ZJU

Construction

HDP: ๐œƒ๐‘—๐‘–|๐บ~๐บ, ๐‘ฅ๐‘—๐‘–|๐œƒ๐‘—๐‘–~๐น(๐œƒ๐‘—๐‘–)

LDA

A very natural model for

those statistics guys,

but for our computer

guysโ€ฆheheโ€ฆ.Finite (F: Mult)

LDA Hierarchical

Dirichlet Multinomial

Mixture Model

Page 17: Non parametric bayesian learning in discrete data

Non-parametric Bayesian Learning

Hierarchical Topic Modeling

What we can get from reviews, blogs, question answers, twitter,

newsโ€ฆโ€ฆ? Only topics? Far not enough

What we really need is a hierarchy to illustrate what exactly the

text tells people, like

5/10/2016 Yueshen Xu 17 Middleware, CCNT, ZJU

Page 18: Non parametric bayesian learning in discrete data

Non-parametric Bayesian Learning

Hierarchical Topic Modeling

Prior: Nested CRP/DP (nCRP) [Blei and Jordan, NIPS, 04]

NCRP: In a restaurant, at the 1st level, there is one table, which is linked

with an infinite number of tables at the 2nd level. Each table at the

second level is also linked with an infinite number of tables at the 3rd

level. Such a structure is repeated...

CRP is the prior to choose a table to form a path

5/10/2016 Yueshen Xu Middleware, CCNT, ZJU

one document, one path

Doc 2

Matryoshka Doll

Page 19: Non parametric bayesian learning in discrete data

Non-parametric Bayesian Learning

Hierarchical Topic Modeling

Generative Process

1. Let ๐‘1 be the root restaurant (only one table)

2. For each level ๐‘™ โˆˆ {2, โ€ฆ , ๐ฟ}:

Draw a table from restaurant ๐‘๐‘™โˆ’1 using CRP. Set ๐‘๐‘™ to be the restaurant referred to

by that table

3. Draw an ๐ฟ -dimensional topic proportion vector ๐œƒ~๐ท๐‘–๐‘Ÿ(๐›ผ)

4. For each word ๐‘ค๐‘›:

Draw ๐‘ง โˆˆ 1,โ€ฆ , ๐ฟ ~ Mult(๐œƒ)

Draw ๐‘ค๐‘› from the topic associated with restaurant ๐‘๐‘ง

5/10/2016 Yueshen Xu

ฮฑ

zm,n

N

c1

c2

cL

T

ฮณ

wm,n

M

ฮฒ

k

m

๐ฟ can be infinite, but not necessary

Page 20: Non parametric bayesian learning in discrete data

Non-parametric Bayesian Learning

What we can get

5/10/2016 Yueshen Xu 20 Middleware, CCNT, ZJU

Page 21: Non parametric bayesian learning in discrete data

Markov Chain Monte Carlo

Markov Chain

Initialization probability: ๐œ‹0 = {๐œ‹0 1 , ๐œ‹0 2 , โ€ฆ , ๐œ‹0(|๐‘†|)}

๐œ‹๐‘› = ๐œ‹๐‘›โˆ’1๐‘ƒ = ๐œ‹๐‘›โˆ’2๐‘ƒ2 = โ‹ฏ = ๐œ‹0๐‘ƒ

๐‘›: Chapman-Kolomogrov equation

Central-limit Theorem: Under the premise of connectivity of P, lim๐‘›โ†’โˆž๐‘ƒ๐‘–๐‘—๐‘›

= ๐œ‹ ๐‘— ; ๐œ‹ ๐‘— = ๐‘–=1|๐‘†|๐œ‹ ๐‘– ๐‘ƒ๐‘–๐‘—

lim๐‘›โ†’โˆž๐œ‹0๐‘ƒ๐‘› =๐œ‹(1) โ€ฆ ๐œ‹(|๐‘†|)โ‹ฎ โ‹ฎ โ‹ฎ๐œ‹(1) ๐œ‹(|๐‘†|)

๐œ‹ = {๐œ‹ 1 , ๐œ‹ 2 , โ€ฆ , ๐œ‹ ๐‘— , โ€ฆ , ๐œ‹(|๐‘†|)}

5/10/2016 21 Middleware, CCNT, ZJU

Stationary Distribution

๐‘‹0~๐œ‹0 ๐‘ฅ โˆ’โ†’ ๐‘‹1~๐œ‹1 ๐‘ฅ โˆ’โ†’ โ‹ฏโˆ’โ†’ ๐‘‹๐‘›~๐œ‹ ๐‘ฅ โˆ’โ†’ ๐‘‹๐‘›+1~๐œ‹ ๐‘ฅ โˆ’โ†’ ๐‘‹๐‘›+2~๐œ‹ ๐‘ฅ โˆ’โ†’

sampleConvergence

Stationary Distribution

Yueshen Xu

|)||(|...)2|(|)1|(|

)12(p...)22(p)12(p

|)|1(...)21()11(p

SSpSpSp

Spp

P

Xm

Xm+1

Page 22: Non parametric bayesian learning in discrete data

Markov Chain Monte Carlo

Gibbs Sampling

5/10/2016 Yueshen Xu 22 Middleware, CCNT, ZJU

Step1: Initialize: ๐‘‹0 = ๐‘ฅ0 = {๐‘ฅ1: ๐‘– = 1,2, โ€ฆ๐‘›}

Step2: for t = 0, 1, 2, โ€ฆ

1. ๐‘ฅ1(๐‘ก+1)~๐‘ ๐‘ฅ1 ๐‘ฅ2

(๐‘ก), ๐‘ฅ3(๐‘ก), โ€ฆ , ๐‘ฅ๐‘›

(๐‘ก);

2. ๐‘ฅ2๐‘ก+1~๐‘ ๐‘ฅ2 ๐‘ฅ1

(๐‘ก+1), ๐‘ฅ3(๐‘ก), โ€ฆ , ๐‘ฅ๐‘›

(๐‘ก)

3. โ€ฆ

4. ๐‘ฅ๐‘—๐‘ก+1~๐‘ ๐‘ฅ๐‘— ๐‘ฅ1

(๐‘ก+1), ๐‘ฅ๐‘—โˆ’1(๐‘ก+1), ๐‘ฅ๐‘—+1(๐‘ก)โ€ฆ , ๐‘ฅ๐‘›(๐‘ก)

5. โ€ฆ

6. ๐‘ฅ๐‘›๐‘ก+1~๐‘ ๐‘ฅ๐‘› ๐‘ฅ1

(๐‘ก+1), ๐‘ฅ2(๐‘ก+1), โ€ฆ , ๐‘ฅ๐‘›โˆ’1

(๐‘ก+1)

๐‘ฅ๐‘–~๐‘ ๐‘ฅ ๐‘ฅโˆ’๐‘–

A(x1,x1)

B(x1,x2)

C(x2,x1)

D

Metropolis-Hastings Sampling

You want to know โ€˜Gibbs sampling for HDP/DPM/nCRPโ€™ ? Youโ€™d better understand

Gibbs sampling for โ€˜LDA and DMMMโ€™

Page 23: Non parametric bayesian learning in discrete data

Reference

โ€ข Yee Whye Teh. Dirichlet Processes: Tutorial and Practical Course, 2007

โ€ข Yee Whye Teh, Jordan M I, etc. Hierarchical Dirichlet Processes, American Statistical

Association, 2006

โ€ข David Blei. Probabilstic topic models. Communications of the ACM, 2012

โ€ข David Blei, etc. Latent Dirichlet Allocation, JMLR, 2003

โ€ข David Blei, etc. The Nested Chinese Restaurant Process and Bayesian Inference of Topic

Hierarchies. Journal of the ACM, 2010

โ€ข Gregor Heinrich. Parameter Estimation for Text Analysis, 2008

โ€ข T.S., Ferguson. A Bayesian Analysis of Some Nonparametric Problems. The Annals of

Statistics, 1973

โ€ข Martin J. Wainwright. Graphical Models, Exponential Families, and Variational Inference

โ€ข Rick Durrett. Probability: Theory and Examples, 2010

โ€ข Christopher Bishop. Pattern Recognition and Machine Learning, 2007

โ€ข Vasilis Vryniotis. DatumBox: The Dirichlet Process Mixture Model, 2014

โ€ข David P. Williams. Gaussian Processes, Duke University, 2006

5/10/2016 Yueshen Xu 23 Middleware, CCNT, ZJU

Page 24: Non parametric bayesian learning in discrete data

Q&A

5/10/2016 Middleware, CCNT, ZJU24Yueshen Xu