Latent Dirichlet Allocation (LDA) Also Known As Topic Modeling · 2018-03-22 · •mathematical...

35
Latent Dirichlet Allocation (LDA) Also Known As Topic Modeling

Transcript of Latent Dirichlet Allocation (LDA) Also Known As Topic Modeling · 2018-03-22 · •mathematical...

Page 1: Latent Dirichlet Allocation (LDA) Also Known As Topic Modeling · 2018-03-22 · •mathematical model •a bit hacky Topic Model (LDA) •probabilistic model •principled -> has

Latent Dirichlet Allocation (LDA)

Also Known As

Topic Modeling

Page 2: Latent Dirichlet Allocation (LDA) Also Known As Topic Modeling · 2018-03-22 · •mathematical model •a bit hacky Topic Model (LDA) •probabilistic model •principled -> has

The Domain: Natural Language TextCollection of documents

Each document consists of a set of word tokens drawn (with replacement) from a set of word types

e.g., “The big dog ate the small dog.”

Goalsconstruct probabilistic generative model of domain

produces observed documents with high probability

obtain a compact representation of each document

unsupervised learning

Page 3: Latent Dirichlet Allocation (LDA) Also Known As Topic Modeling · 2018-03-22 · •mathematical model •a bit hacky Topic Model (LDA) •probabilistic model •principled -> has

Two Contrasting Approaches To ModelingEnvironments Of Words And Text

Latent Semantic Analysis (LSA)• mathematical model

• a bit hacky

Topic Model (LDA)• probabilistic model

• principled -> has produced many extensions and embellishments

Page 4: Latent Dirichlet Allocation (LDA) Also Known As Topic Modeling · 2018-03-22 · •mathematical model •a bit hacky Topic Model (LDA) •probabilistic model •principled -> has

LSAThe set up

D documentsW distinct wordsF = WxD coocurrence matrixfwd = frequency of word w in document d

Page 5: Latent Dirichlet Allocation (LDA) Also Known As Topic Modeling · 2018-03-22 · •mathematical model •a bit hacky Topic Model (LDA) •probabilistic model •principled -> has

LSA: Transforming The Co-occurence MatrixRelative entropy of a word across documents

fwd/fw: P(d|w)

Hw = value in [0, 1]0=word appears in only 1 doc1=word spread across all documents

Specificity: (1-Hw)0 = word tells you nothing about the document;1= word tells you a lot about the document

Page 6: Latent Dirichlet Allocation (LDA) Also Known As Topic Modeling · 2018-03-22 · •mathematical model •a bit hacky Topic Model (LDA) •probabilistic model •principled -> has

LSA: Transforming The Co-occurence MatrixG = WxD normalized coocurrence matrix

log transform common for word freq analysis

+1 ensures no log(0)

weighted by specificity

Representation of word i: row i of Gproblem: high dimensional representation

problem: doesn’t capture similarity structure of documents

Page 7: Latent Dirichlet Allocation (LDA) Also Known As Topic Modeling · 2018-03-22 · •mathematical model •a bit hacky Topic Model (LDA) •probabilistic model •principled -> has

LSA: Representing A WordDimensionality reduction via SVD

G = M1 M2 M3

[WxD] = [WxR] [RxR] [RxD]

if R = min(W,D) reconstruction is perfect

if R < min(W,D) least squares reconstruction, i.e., capture whatever structure there is in matrix with a reduced number of parameters

Reduced representation of word i: row i of (M1M2)

Reduced representation of document j: column j of (M2M3)

Advantages of a reduced representationCompactness

Hopefully captures statistical regularities and discards noise

Page 8: Latent Dirichlet Allocation (LDA) Also Known As Topic Modeling · 2018-03-22 · •mathematical model •a bit hacky Topic Model (LDA) •probabilistic model •principled -> has

LSA Versus Topic ModelLSA representation vectors have elements (features) that

• can be negative

• are completely unconstrained

If we wish to operate in a currency of probability, then the elements

• must be nonnegative

• must sum to 1

Terminology• LSA = LSI = latent semantic indexing

• pLSI = probabilistic latent semantic indexing

• LDAtopic model

Page 9: Latent Dirichlet Allocation (LDA) Also Known As Topic Modeling · 2018-03-22 · •mathematical model •a bit hacky Topic Model (LDA) •probabilistic model •principled -> has

pLSI (Hoffman, 1999)Probabilistic model of language production

Generative modelSelect a document with probability P(D)

Select a (latent) topic with probability P(Z|D)

Generate a word with probability P(W|Z)

Produce pair <di, wi> on draw i

P(D, W, Z) = P(D) P(Z|D) P(W|Z)

P(D, W) = Σz P(D) P(z|D) P(W|z)

P(W | D) = Σz P(z|D) P(W|z)

D

Z

W

Page 10: Latent Dirichlet Allocation (LDA) Also Known As Topic Modeling · 2018-03-22 · •mathematical model •a bit hacky Topic Model (LDA) •probabilistic model •principled -> has

Inferring Latent VariableP(Z|D,W)

P(D, W, Z) = P(D) P(Z|D) P(W|Z)

P(D, W) = Σz P(D) P(z|D) P(W|z)

P(Z|D,W) = P(D, W, Z) / P(D, W)

= P(Z|D) P(W|Z) / [Σz P(z|D) P(W|z)]

D

Z

W

Page 11: Latent Dirichlet Allocation (LDA) Also Known As Topic Modeling · 2018-03-22 · •mathematical model •a bit hacky Topic Model (LDA) •probabilistic model •principled -> has

Plate NotationWay of representing

• multiple documentsN total

• multiple words per documentLi words in document i

D

Z

WN

Li

Page 12: Latent Dirichlet Allocation (LDA) Also Known As Topic Modeling · 2018-03-22 · •mathematical model •a bit hacky Topic Model (LDA) •probabilistic model •principled -> has

Plate NotationWay of representing

• multiple documentsN total

• multiple words per documentLi words in document i

D

Z

WN

Li

D

Z

W

Z

W

Z

W

Z

W

D

Z

W

Z

W

Z

W

Z

W

Page 13: Latent Dirichlet Allocation (LDA) Also Known As Topic Modeling · 2018-03-22 · •mathematical model •a bit hacky Topic Model (LDA) •probabilistic model •principled -> has

Translating NotationBarber Typical Topic Modeling

Notation

total # documents N Ntotal # topics K T

total # word types D (dictionary) Windex over documents n

i: index over document-word pairs

{wi, di}

index over words in document w

index over words in dictionary i

topic assignment zi: topic of word-document pair i

distribution over topics { } { }

distribution over words { } { }

index over topics k j

zwn

πkn θj

di

θik φwi

j

Page 14: Latent Dirichlet Allocation (LDA) Also Known As Topic Modeling · 2018-03-22 · •mathematical model •a bit hacky Topic Model (LDA) •probabilistic model •principled -> has

Two Approaches To Learning Conditional Probabilities

P(Z=j | D=di) or

P(W=wi | Z=j) or

Hoffmann (1999)Search for the single best θ and φ via gradient descent in cross entropy (difference between distribution) of data and model

– Σw,d n(d,w) log P(d,w)

Griffiths & Steyvers (2002, 2005); Blei, Ng, & Jordan (2003)Hierarchical Bayesian inference: Treat θ and φ as random variables

θjdi

φwij

Page 15: Latent Dirichlet Allocation (LDA) Also Known As Topic Modeling · 2018-03-22 · •mathematical model •a bit hacky Topic Model (LDA) •probabilistic model •principled -> has

Treating θθθθ And φφφφ As Random Variables

Can marginalize over uncertainty, i.e.,

Model

P Z D( ) P Z D θ,( )P θ( )θ∫=

P W Z( ) P W Z φ,( )P φ( )φ∫=

D1

Z1

W1

D2

Z2

W2

D3

Z3

W3

θ

φ

Page 16: Latent Dirichlet Allocation (LDA) Also Known As Topic Modeling · 2018-03-22 · •mathematical model •a bit hacky Topic Model (LDA) •probabilistic model •principled -> has

Treating θθθθ And φφφφ As Random VariablesThe two conditional distributions are defined over discrete alternatives.

P(Z=j | D=di) or

P(W=wi | Z=j) or

If n alternatives, distribution can be represented by categorical RV with n–1 degrees of freedom.

To represent θθθθ and φφφφ as random variables, need to encode a distribution over distributions...

θjdi

φwij

Page 17: Latent Dirichlet Allocation (LDA) Also Known As Topic Modeling · 2018-03-22 · •mathematical model •a bit hacky Topic Model (LDA) •probabilistic model •principled -> has

Dirichlet Distribution• generalization of beta distribution from 2 alternatives to n

alternatives

• probability distribution over categorical distributions

• for categorical RV with n alternatives, Dirichlet has n parameters,

Each parameter can be thought of as a count of the number of occurrences.

Why n and not n–1 since there are n–1 degrees of freedom?

αααα α1 α2 … αn, , ,{ }=

p X αααα( ) 1B αααα( )------------- xi

αi 1–

i 1=

n

∏=

Page 18: Latent Dirichlet Allocation (LDA) Also Known As Topic Modeling · 2018-03-22 · •mathematical model •a bit hacky Topic Model (LDA) •probabilistic model •principled -> has

Visualizing Dirichlet DistributionYou can think of the uncertainty space over n probabilities constrained such that P(x) = 0 if (ΣΣΣΣi xi) != 1 or if xi < 0...

...or the representational space over n–1 probabilities constrained such that P(x)=0 if (ΣΣΣΣi xi) > 1 or if xi < 0.

1

1

Page 19: Latent Dirichlet Allocation (LDA) Also Known As Topic Modeling · 2018-03-22 · •mathematical model •a bit hacky Topic Model (LDA) •probabilistic model •principled -> has

Dirichlet Distribution (n=3)

1.00 1.00 1.00

p2

p1

0 0.5 10

0.5

1

2.00 1.00 1.00

p2

p1

0 0.5 10

0.5

1

1.00 5.00 1.00

p2

p1

0 0.5 10

0.5

1

1.00 1.00 10.00

p2

p1

0 0.5 10

0.5

1

2.00 3.00 5.00

p2

p1

0 0.5 10

0.5

1

20.00 30.00 50.00

p2

p1

0 0.5 10

0.5

1

5.00 4.00 1.00

p2

p1

0 0.5 10

0.5

1

0.90 1.00 0.80

p2

p1

0 0.5 10

0.5

1

0.50 0.50 0.50

p2

p1

0 0.5 10

0.5

1

Page 20: Latent Dirichlet Allocation (LDA) Also Known As Topic Modeling · 2018-03-22 · •mathematical model •a bit hacky Topic Model (LDA) •probabilistic model •principled -> has

Dirichlet Distribution (n=3)

Page 21: Latent Dirichlet Allocation (LDA) Also Known As Topic Modeling · 2018-03-22 · •mathematical model •a bit hacky Topic Model (LDA) •probabilistic model •principled -> has

Dirichlet Is Conjugate Prior Of Categorical and Multinomial Distributions

Simple exampleφ φ φ φ ~ Dirichlet(1, 3, 4)

O = {w1, w1, w2, w3, w2, w1}

φφφφ | O ~ Dirichlet(4, 5, 5)

Weak assumption about priorφφφφ ~ Dirichlet(β, β, ββ, β, ββ, β, ββ, β, β)

Page 22: Latent Dirichlet Allocation (LDA) Also Known As Topic Modeling · 2018-03-22 · •mathematical model •a bit hacky Topic Model (LDA) •probabilistic model •principled -> has

Full Model

D

Z

WN

Li

φφφφ ββββK

θθθθααααN

Page 23: Latent Dirichlet Allocation (LDA) Also Known As Topic Modeling · 2018-03-22 · •mathematical model •a bit hacky Topic Model (LDA) •probabilistic model •principled -> has

Collapsed Gibbs Sampling Approach

1. Define Dirichlet priors on and

2. Perform sampling over latent variables Z, integrating out or collapsing over θθθθ and φφφφ

This can be done analytically due to Dirichlet-Categorical relationship

Note: no explicit representation of posterior P(θθθθ, φ| φ| φ| φ| Z, D, W)

D1

Z1

W1

D2

Z2

W2

D3

Z3

W3

θ

φ

θdi φj

P Zi Z i– D W, ,( ) P W Z φ,( )P Z D θ,( )P φ( )P θ( ) φd θdθ φ,∫∼

Page 24: Latent Dirichlet Allocation (LDA) Also Known As Topic Modeling · 2018-03-22 · •mathematical model •a bit hacky Topic Model (LDA) •probabilistic model •principled -> has

Collapsed Gibbs Sampling

Ignore α and β for the momentFirst term: proportion of topic j draws in which wi picked

Second term: proportion of words in document di assigned to topic j

This formula integrates out the Dirichlet uncertainty over the categorical probabilities!

What are α and β?Effectively, they function as smoothing parameters

Small -> bias toward documents containing just a few topics (and topics containing a few words)

i: index overword-doc pairs

Page 25: Latent Dirichlet Allocation (LDA) Also Known As Topic Modeling · 2018-03-22 · •mathematical model •a bit hacky Topic Model (LDA) •probabilistic model •principled -> has

Detailed Procedure For Sampling From P(Z|D,W)1. Randomly assign each <di, wi> pair a zi value.

2. For each i, resample according to equation on previous slide (one iteration)

3. Repeat for a burn in of, say, 1000 iterations

4. Use current assignment as a sample and estimateP(Z|D)

P(W|Z)

Typically with Gibbs sampling, the results of multiple chains (restarts) are used. Why wouldn’t that work here?

Page 26: Latent Dirichlet Allocation (LDA) Also Known As Topic Modeling · 2018-03-22 · •mathematical model •a bit hacky Topic Model (LDA) •probabilistic model •principled -> has

Results

Arts Budgets Children Education

new million children schoolfilm tax women studentsshow program people schoolsmusic budget child educationmovie billion years teachersplay federal families highmusical year work publicbest spending parents teacheractor new says bennettfirst state family manigatyork plan welfare namphyopera money men statetheater programs percent presidentactress government care elementarylove congress life haiti

(a)

The William Randolph Hearst Foundation will give $ 1.25 million to Lin-coln Center, Metropolitan Opera Co., New York Philharmonic and JuilliardSchool. Our board felt that we had a real opportunity to make a mark onthe future of the performing arts with these grants an act every bit asimportant as our traditional areas of support in health, medical research,education and the social services, Hearst Foundation President RandolphA. Hearst said Monday in announcing the grants. Lincoln Centers sharewill be $200,000 for its new building, which will house young artists andprovide new public facilities. The Metropolitan Opera Co. and New YorkPhilharmonic will receive $400,000 each. The Juilliard School, where musicand the performing arts are taught, will get $250,000. The Hearst Foun-dation, a leading supporter of the Lincoln Center Consolidated CorporateFund, will make its usual annual $100,000 donation, too.

(b)

Page 27: Latent Dirichlet Allocation (LDA) Also Known As Topic Modeling · 2018-03-22 · •mathematical model •a bit hacky Topic Model (LDA) •probabilistic model •principled -> has

Results

Page 28: Latent Dirichlet Allocation (LDA) Also Known As Topic Modeling · 2018-03-22 · •mathematical model •a bit hacky Topic Model (LDA) •probabilistic model •principled -> has

Results

(This and following slides from David Blei tutorial)

Page 29: Latent Dirichlet Allocation (LDA) Also Known As Topic Modeling · 2018-03-22 · •mathematical model •a bit hacky Topic Model (LDA) •probabilistic model •principled -> has

Results

1880 1900 1920 1940 1960 1980 2000

o o o o o oooo

o

o

o

o

o

oooo o o o o

o o oo o o

o o oo o

oooo o

oo

o

o

o

o

o

o

o

o

o o

o oo o

oo

o

oo o

o

o

o

o

o

o

o

o

o

oo o

oo o

1880 1900 1920 1940 1960 1980 2000

o o o

o

oo

oo

o

oo o

o

o

oo o o o o

oo o

o o

o oooo

o

o

oo

ooo

o

o

o

o

o

ooooooo o

o o o o o o o oo o o

oooo

o

o

o

o

o o o o oo

RELATIVITY

LASER

FORCE

NERVE

OXYGEN

NEURON

"Theoretical Physics" "Neuroscience"

How might these graphs have been obtained?

Page 30: Latent Dirichlet Allocation (LDA) Also Known As Topic Modeling · 2018-03-22 · •mathematical model •a bit hacky Topic Model (LDA) •probabilistic model •principled -> has

Results

wild typemutant

mutationsmutantsmutation

plantsplantgenegenes

arabidopsis

p53cell cycleactivitycyclin

regulation

amino acidscdna

sequenceisolatedprotein

genedisease

mutationsfamiliesmutation

rnadna

rna polymerasecleavage

site

cellscell

expressioncell lines

bone marrow

united stateswomen

universitiesstudents

education

sciencescientists

saysresearchpeople

researchfundingsupport

nihprogram

surfacetip

imagesampledevice

laseropticallight

electronsquantum

materialsorganicpolymerpolymersmolecules

volcanicdepositsmagmaeruption

volcanism

mantlecrust

upper mantlemeteorites

ratios

earthquakeearthquakes

faultimages

data

ancientfoundimpact

million years agoafrica

climateocean

icechanges

climate change

cellsproteins

researchersproteinfound

patientsdisease

treatmentdrugsclinical

geneticpopulationpopulationsdifferencesvariation

fossil recordbirds

fossilsdinosaurs

fossil

sequencesequences

genomedna

sequencing

bacteriabacterial

hostresistanceparasite

developmentembryos

drosophilagenes

expression

speciesforestforests

populationsecosystems

synapsesltp

glutamatesynapticneurons

neuronsstimulusmotorvisual

cortical

ozoneatmospheric

measurementsstratosphere

concentrations

sunsolar wind

earthplanetsplanet

co2carbon

carbon dioxidemethane

water

receptorreceptors

ligandligands

apoptosis

proteinsproteinbindingdomaindomains

activatedtyrosine phosphorylation

activationphosphorylation

kinase

magneticmagnetic field

spinsuperconductivitysuperconducting

physicistsparticlesphysicsparticle

experimentsurfaceliquid

surfacesfluid

model reactionreactionsmoleculemolecules

transition state

enzymeenzymes

ironactive sitereduction

pressurehigh pressure

pressurescore

inner core

brainmemorysubjects

lefttask

computerproblem

informationcomputersproblems

starsastronomers

universegalaxiesgalaxy

virushiv

aidsinfectionviruses

miceantigent cells

antigensimmune response

How might this graph have been obtained?

Page 31: Latent Dirichlet Allocation (LDA) Also Known As Topic Modeling · 2018-03-22 · •mathematical model •a bit hacky Topic Model (LDA) •probabilistic model •principled -> has

Predicting word association norms“the” -> ?

“dog” -> ?

Page 32: Latent Dirichlet Allocation (LDA) Also Known As Topic Modeling · 2018-03-22 · •mathematical model •a bit hacky Topic Model (LDA) •probabilistic model •principled -> has

Median Rank of k’th Associate

Note: multiple resamples can be used here

Page 33: Latent Dirichlet Allocation (LDA) Also Known As Topic Modeling · 2018-03-22 · •mathematical model •a bit hacky Topic Model (LDA) •probabilistic model •principled -> has

The Topic Modeling IndustryVery popular methodology because it can be mapped to many problem domains

E.g., Netflix taskdocument -> user

word -> film viewed

topic -> grouping of users by preferences, grouping of films by similarity

E.g., microbial source trackingdocument -> sample

word -> bacterial DNA

topic -> bacteria source

Page 34: Latent Dirichlet Allocation (LDA) Also Known As Topic Modeling · 2018-03-22 · •mathematical model •a bit hacky Topic Model (LDA) •probabilistic model •principled -> has

Combining Syntax and SemanticsLSA and Topic Model are “bag o’ words” models

Model sequential structure with 3d order HMMhidden state is category of word; 50 states

1 state for start or end of a sentence

48 states for document-independent words (syntax)

1 state for document-dependent words (semantics)

Semantics generated by topic model

Page 35: Latent Dirichlet Allocation (LDA) Also Known As Topic Modeling · 2018-03-22 · •mathematical model •a bit hacky Topic Model (LDA) •probabilistic model •principled -> has

Categorical distributions (most probable words in state)