Topic Models for Signal Processing - Carnegie Mellon University

Topic Models for Signal Processing

Bhiksha Raj, Carnegie Mellon UniversityParis Smaragdis, University of Illinois,

Urbana ChampaignUrbana‐Champaignhttp://topicmodelsforaudio.com [org]

ICASSP 2011 Tutorial: Applications of Topic Models for Signal Processing – Smaragdis,

Raj

The Engineer and the MusicianOnce upon a time a rich potentate discovered a previously unknown p yrecording of a beautiful piece of music. Unfortunately it was badly damaged.

He greatly wanted to find out what it would sound like ifHe greatly wanted to find out what it would sound like if it were not.

So he hired an engineer and a musician to solve the problem..

ICASSP 2011 Tutorial: Applications of Topic Models for Signal Processing – Smaragdis, Raj

The Engineer and the Musician

The engineer worked for many years. He spent much money and publishedHe spent much money and published many papers.

Finally he had a somewhat scratchy restoration of the music..

The musician listened to the music carefully for a day, transcribed it, broke out his trusty keyboard andbroke out his trusty keyboard and replicated the music.


The Prize

Who do you think won the princess?


Sounds – an example• A sequence of notes

• Chords from the same notesChords from the same notes

• A piece of music from the same (and a few additional) notes


Sounds – another example• A sequence of sounds

• A proper speech utterance from the same soundsp p p


Template Sounds Combine to Form a Signal

• The component sounds “combine” to form complex sounds– Notes form music– Phoneme‐like structures combine in utterances

• Sound in general is composed of such “building blocks” or g p gthemes– Which can be simple – e.g. notes, or complex, e.g. phonemes

Our definition of a building block: the entire structure occurs– Our definition of a building block: the entire structure occurs repeatedly in the process of forming the signal

• Goal: Learn these building blocks automatically from data• Goal: Learn these building blocks automatically, from data– Learn to be the musician– But through automated analysis of data


The representation

TIMEAMPL FREQ

TIME

• Requirement: Building blocks must combine additively– Linear: The presence of two notes does not distort either note

TIME TIME

– Constructive: Notes do not cancel – adding one note will not diminish another

• Spectral representations: The magnitude spectra of uncorrelated signals addh dd h h ld b f d– In theory, power spectra add. In practice, this holds better for magnitude spectra

• We represent signals spectrographically, as a sequence of magnitude spectral vectors estimated from (overlapping) segments of signalspectral vectors estimated from (overlapping) segments of signal


Discovering Building Blocks• The magnitude spectrogram is a matrix of numbers

– The f‐th frequency component in frame t is the (t,f )‐th entry in the matrix

• Standard matrix decomposition methods may be applied to discover additive p y ppbuilding blocks:

– PCA, ICA...

• Constraint: PCA ICA etc will result in bases with negative components• Constraint: PCA, ICA etc. will result in bases with negative components– Not physically meaningful (what is a negative power spectrum)

Example: Spectrogram of two intermittent tones

Example:

Q


PCA basesfrom the tonessound

FREQ + +


Discovering Building Blocks• The power spectrogram is a matrix of numbers

– The f‐th frequency component in frame t is the (t,f )‐th entry in the matrix

• Standard matrix decomposition methods may be applied to discover additive p y ppbuilding blocks:

– PCA, ICA...

• Constraint: PCA ICA etc will result in bases with negative components• Constraint: PCA, ICA etc. will result in bases with negative components– Not physically meaningful (what is a negative power spectrum)


Example:


Q

PCA basesfrom the tonessound

+ +FREQ


Magnitude Spectra are like Histograms

• Magnitude spectrograms comprise sequences of magnitude spectra• Magnitude spectra are sequences of non‐negative numbers

– Which can be viewed as scaled histograms• Histograms may be viewed as the outcome of draws from a g y

multinomial distribution

• Analysis techniques related to a multinomial processes applyy q p pp y


What follows

• A Tutorial in two parts• Part A:

– Model definitionEquations update rules and other squiggles– Equations, update rules and other squiggles

• Part B:– Applications and extensions and the other fun stuffApplications and extensions and the other fun stuff

• What is coveredli i f l i i l d i d l di– Application of multinomial and topic models to audio

• What is not covered– Text and text‐related modelsText and text related models


Part A: Overview

• Multinomial distributions and Topic models• Signals representations• Application of PLCA to signalsApplication of PLCA to signals

– Bag of frequencies modelB f d l– Bag of spectrograms model

• Latent variable models with priors– Sparsity and entropic priors– Cross‐entropic priorsCross entropic priors


Part A: Overview

• Convolutional models• High‐dimensional data

– Tensors and higher dimension data


Part B OverviewPart B Overview

• Low‐Rank Models of SignalsLow Rank Models of Signals• Separation of Monophonic Mixtures

i i i i• Recognition in Mixtures• Temporal demixing• Pitch Tracking• User InterfacesUser Interfaces• Multimodality


The Multinomial Distribution

• The outcome of the draw can take one of a finite number of discrete values– A man drawing coloured balls from an urn

• The probability of retrieving a ball of a given color depends on the fraction of balls in the urn that are of that color

– P(red) = no. red balls / Total ballsICASSP 2011 Tutorial: Applications of Topic

Models for Signal Processing – Smaragdis, Raj

A probability distribution over words

lion

lion lioncat

cat catdog

dog

dog

P(word)

lionlion lion

catdog

dog

lion dog cat

• The balls have words written on them– I e draws now obtain words not colorsI.e. draws now obtain words, not colors

• P(word) = No. of balls with given word on it / total ballstotal balls


The probability of a collectiondogdoglion

dog x num(dog)cat x num(cat)li (li )=

lion

lion

lion lioncat

cat

cat cat

dog

dog

dog

dog catdoglion

lion x num(lion)

..

wordnwordnwordn wordPCwordPwordCPwordofwordnwordofwordnP

)()()(2211

)()()(,...) )(, )((

21 word

wordPCwordPwordCP 21 )(...)()(

')(log)(,...))(, )((log 2211 CwordPwordnwordofwordnwordofwordnP

• Probability of drawing a specific collection of wordsC f i

)(g)(, ))(,)((g 2211 ffword

• Constant accounts for permutationsICASSP 2011 Tutorial: Applications of Topic


Estimating probabilitiesg p

dog x num(dog)cat x num(cat) rd

)

lion

lion

lion lioncat

cat

cat cat

dog

dog

dog

dog

cat x num(cat)lion x num(lion)

Given Find

lion dog cat

P(wor

• Given: A collection of words, find the underlying probability distributionunderlying probability distribution– A problem we encounter repeatedly

• The maximum likelihood estimate:• The maximum likelihood estimate: wordword

wordP wordPwordPwordnwordP )()(log)(maxarg)}({ )}({ wordword

')'(

)()(

wordwordn

wordnwordP

The expected countsp

lion

lion

lion lioncat

cat

cat cat

dog

dog

dog

dog

lion dog cat

P(word)

• Given:

dog

– The probabilities of all words P(word)– The total number of draws NWh t i• What is:– The expected number of draws of any word?

• Answer: E[n(word)] = N P(word)• Answer: E[n(word)] = N.P(word)ICASSP 2011 Tutorial: Applications of Topic


The inverse problemp

lion

lion

lion lioncat

cat

cat cat

dog

dog

dog

dog

lion dog cat

P(word)

• Given: The number of times “lion” was drawn

dog

N(lion)• How many times was “dog” drawn?y g

)()()()( dogP

lionPlionNdogN

)(lionPICASSP 2011 Tutorial: Applications of Topic


The inverse multinomial• Given P(word) for all words• Observed n(word ) n(word ) n(word )• Observed n(word1), n(word2)..n(wordk)• What is n(wordk+1), n(wordk+2)…

ki

wordnio

i

kiio

kkiwordPP

wordnN

wordnNwordnwordnP )(

21 )()()(

)(),...)(),((

• No is the total number of observed counts

ki

io wordnN )()(

– n(word1) + n(word2) + …

• Po is the total probability of observed events– P(word1) + P(word2) + …


An Modified Experimentdog dog lion cat dog lion ..

• Multiple pots with different distributions

lion

lion

lion lioncat

cat

cat cat

dog

dog

dog

dog lion

lion

lion lioncat

cat

cat cat

dog

dog

dog

dog lion

lion

lion lioncat

cat

cat cat

dog

dog

dog

dog lion

lion

lion lioncat

cat

cat cat

dog

dog

dog

dog lion

lion

lion lioncat

cat

cat cat

dog

dog

dog

dog lion

lion

lion lioncat

cat

cat cat

dog

dog

dog

dog

• Multiple pots with different distributions– The picker first selects a pot with probability P(pot)

h h d b ll f h– Then he draws a ball from the pot

• We only see the outcomes!• Can we estimate the pots


A mixture multinomial

• The outcomes are a mixtureof draws from many pots

dog dog lion cat dog lion ..

of draws from many pots– i.e. a mixture of draws from several multinomials catdog

lion

lionlioncat

cat

cat

dog

dogdog

dog lion

lionlioncat

cat

cat

dog

dogdog

dog lion

lionlioncat

cat

cat

dog

dogdog

dog lion

lionlioncat

cat

cat

dog

dogdog

dog lion

lionlioncat

cat

cat

dog

dogdog

dog

several multinomials

• The process is a mixture multinomialP b bili f d• Probability of word

ZXPZPXP )|()()(

• Z is the potX th d

Z

• X are the wordsICASSP 2011 Tutorial: Applications of Topic


Mixture multinomials are multinomials too

lion

lion

lion lioncat

cat

cat cat

dog

dog

dog

dog lion

lion

lion lioncat

cat

cat cat

dog

dog

dog

dog lion

lion

lion lioncat

cat

cat cat

dog

dog

dog

dog lion

lion

lion lioncat

cat

cat cat

dog

dog

dog

dog lion

lion

lion lioncat

cat

cat cat

dog

dog

dog

dog lion

lion

lion lioncat

cat

cat cat

dog

dog

dog

dog

• From the outside viewer’s perspective, it is j t lti i ljust a multinomial– Outcome is one of a countable, finite set of values

• Constraint: Probabilities are composed from component multinomials

Z

ZXPZPXP )|()()(

Multinomials and Simplexes

• Multinomials are probability vectors– If vocab size = V, live in V‐1 subspace of V‐space– Strictly in positive orthant

• Mixture multinomials reside within simplexes specified by p p ycomponent multinomials


Estimating Mixture Multinomials

X

XPXnXofXnXofXnP )(log)(,...) )(, )((log 2211

ZXPZPX )|()(l)( X Z

ZXPZPXn )|()(log)(

• Direct optimization not possible• Expectation Maximization

ZXPZPZXQXnZXPZPXn )|()(),(log)()|()(log)(

X ZX Z ZXQZXQXnZXPZPXn

),(),(log)()|()(log)(

ZXPZPZXQXn )|()(log),()(

X Z ZXQZXQXn

),(log),()(


EM as counting


lion

lion

lion lioncat

cat

cat cat

dog

dog

dog

dog lion

lion

lion lioncat

cat

cat cat

dog

dog

dog

dog lion

lion

lion lioncat

cat

cat cat

dog

dog

dog

dog lion

lion

lion lioncat

cat

cat cat

dog

dog

dog

dog lion

lion

lion lioncat

cat

cat cat

dog

dog

dog

dog lion

lion

lion lioncat

cat

cat cat

dog

dog

dog

dog

f h f h d k• If the exact urn for each drawn instance was known– Compute distribution of each urn using urn‐specific counts


EM as counting


lion

lion

lion lioncat

cat

cat cat

dog

dog

dog

dog lion

lion

lion lioncat

cat

cat cat

dog

dog

dog

dog lion

lion

lion lioncat

cat

cat cat

dog

dog

dog

dog lion

lion

lion lioncat

cat

cat cat

dog

dog

dog

dog lion

lion

lion lioncat

cat

cat cat

dog

dog

dog

dog lion

lion

lion lioncat

cat

cat cat

dog

dog

dog

dog

Wh th t i k• When the exact urn is unknown– Fragment each instance according to Q(X,Z)

• Based on a current estimate of probabilitiesE i t f X ill b f t d th• Every instance of X will be fragmented the same way

– Compute distribution of each urn by counting fragmentsICASSP 2011 Tutorial: Applications of Topic


EM as fragmentation and counting

)'|()'()|()(),(ZXPZP

ZXPZPZXQ Fragmentation'

)|()(Z

)()( ZXQXn )()( ZXQXn

')',()(

),()()(

Z X

X

ZXQXn

ZXQXnZP

'),'()'(

),()()|(

XZXQXn

ZXQXnZXP

ll h “f d ”

Z X

Counting• We will use the “fragmentation and counting” explanation for the most part in the remaining slides• Instead of deriving detailed solutionsInstead of deriving detailed solutions

• Which are probably easy to derive anywayICASSP 2011 Tutorial: Applications of Topic


Modification: Given P(X|Z)

)'|()'()|()(),(ZXPZP

ZXPZPZXQ Fragmentation'

)|()(Z

)()( ZXQXn

')',()(

),()()(

Z X

X

ZXQXn

ZXQXnZP Counting

• Given P(X|Z) (urn) and a collection of counts n(X)

Z X

• Given P(X|Z) (urn) and a collection of counts n(X)• What is the manner in which urns were selected?

Wh t i P(Z)?• What is P(Z)?


Topics and Bags of Words

• The bag of words model:– Put all words in a “document”into a “bag”

thei kj d

theover supper– Only considers occurrenceof words

S i f ti t d

quickbrown

fox

jumped

lazydog

thedoghad

thecatfor

pp

• Sequence information not used

• An alternate vector representation[4 1 1 1 1 1 1 2 1 1 1 1 ]

– Each number represents the number of instances The Quick Brown Fox Jumped Over Lazy Dog Had Cat For Supper[4 1 1 1 1 1 1 2 1 1 1 1 ]

of a wordICASSP 2011 Tutorial: Applications of Topic


Topics and Documents

secretarythequick

runningthrow

tacklethehome

lazyrun

the ballhad

thecatfor

goalrunningthecourt

motion

congressthepresident

camprun

the ballhad

thecatvote for

green

runningthelights

saucer

floatingthemartian

ringsantennae

the crophad

the

alien

the the the p

lion

lion

lion lioncat

cat

cat cat

dog

dog

dog

dog sports lion

lion

lion lioncat

cat

cat cat

dog

dog

dog

dog politics lion

lion

lion lioncat

cat

cat cat

dog

dog

dog

dogalienabduction

• Bags of words for different topics have different distributions– Drawn from different pots– Different words with different frequenciesDifferent words with different frequencies


Documents as mixtures of topics

lion

lion lioncat

cat catdog

dog

dog sports lion

lion lioncat

cat catdog

dog

dog politicslion cat catdog dog

alienlion

catdoglion

catdog

lion

lion lioncat

catdog

dogabduction

• Bag of words model for newspaper: mixes g p ptopics

• Given word distributions for topics and bag ofGiven word distributions for topics and bag of words for newspaper, find composition?– Estimate mixture weightsEstimate mixture weights


Documents as mixtures of topicssports

politics

newspaper

alien abduction

• A document lies in topic simplex– Mixture weights are inversely related to distance g yto topic

– May be multiple ways of mixing topics to obtain same document


Documents as mixtures of topics(0,0,1)

topic1

topic3

(0,0,1) = topic1

(1,0,0)

topic2

(1,0,0) = topic3

( )

• Mixture multinomial modelmaps document from a

(0,1,0)

topic2 (0,1,0) = topic2

• Mixture multinomial model maps document from a point within topic simplex to a point in mixture weight simplex– Different locations in mixture weight simplex may result in same document vector

D di t f t i t• Depending on arrangement of topic vectors


Documents differ in composition(1,0,0) = sportsSI

NYT

( ) l bd(0,0,1) = politics

Weekly world news

• Relative proportion of different topics varies

(0,1,0) = alien abduction(0,0,1) politics

Relative proportion of different topics varies– I.e P(Z) has different structure for different documents

• I.e. location within mixture weight and/or topic simplex g / p pdiffers for different documents


A priori probabilities on mixturesP()

= {P(Z)}

• Location of document in mixture weight simplex may vary

• Distribution of locations of documents from a given category can be captured by a priori probability distributions


Dirichlet Distributions

i

iiP

BP 1

)(1);(

i)(

ii

B

)(

)(

ii

• The most common model for distributions of a priori probabilities are Dirichlet distributions– Mathematical tractability– Conjugate prior to multinomial j g p


Estimation with prior)()|}({logmaxargˆ PXP

ˆ )(log)()(log)(maxarg PwordPwordPwordnwordword

• Utilize the a priori distribution to better estimate mixture weights to explain new documents– Resolves issue of multiple solutionsResolves issue of multiple solutions– Improves estimate in all cases

• Maximum a posteriori estimator• Compare to maximum likelihood estimator below

wordPwordPwordn )()(log)(maxargˆ wordword

wordPwordPwordn )()(log)(maxarg


Estimating the Dirichlets (and others)Estimating the Dirichlets (and others)

• Estimate mixture weights on training data and g gestimate Dirichlet parameters

• Jointly estimate mixture weights and Dirichlets on• Jointly estimate mixture weights and Dirichlets on training data

• Jointly estimate topics, mixture weights and Dirichletparameters from training data

• Requires complex estimation procedure that are not relevant to us


More complex modelsMore complex models

• Other forms of priorsOther forms of priors– E.g. correlated topics

• Priors on priors• Priors on priors• Capturing temporal sequences

– E.g. Markov chains on topics

• Etc..• Not directly relevant


Mixture multinomials and audioMixture multinomials and audio

• Audio data representations are similar toAudio data representations are similar to histograms

• Topic model frameworks apply.

• We will generically refer to topic models and extensions applied to audio as PLCAextensions applied to audio as PLCA– Probabilistic Latent Component Analysis


Representing Audio

• Basic representation: Spectrogram– Computed using a short‐time Fourier transform– Segment audio into “frames” of 20‐64ms

Adj t f l b 75%• Adjacent frames overlap by 75%

– “Window” the segmentsCompute a Fourier spectrum from each frame– Compute a Fourier spectrum from each frame


Spectrogram

S1(1) S2(1) S3(1) S4(1) S5(1) S6(1) …

S1(2) S2(2) S3(2) S4(2) S5(2) S6(2) …

S1(3) S2(3) S3(3) S4(3) S5(3) S6(3) …

… … … … … … …

• The spectrogram is a series of complex vectors.– Can be decomposed into magnitude and phase

• We will however work primarily on the magnitude• Why: magnitude spectra of uncorrelated signals combine

ddi i ladditively– In theory power spectra are additive, but in practice, this holds

better for magnitudesg


The Spectrogram as a Histogram• A generative model for one frame of a spectrogram• A magnitude spectrum represents energy against discrete

frequencies• This may be viewed as a histogram of draws from a multinomial

FRAME tHISTOGRAM

f Pt (f )

FRAME t

f

Power spectrum of frame t

Probability distribution underlying the t-th spectral vectorICASSP 2011 Tutorial: Applications of Topic


A Mixture Multinomial Model

• The “picker” has multiple urns– For each draw he selects an urn and then draws a ball from it

• Overall probability of drawing f is a mixture multinomial

HISTOGRAM

multiple draws


The Picker Generates a Spectrogram

• The picker has a fixed set of Urns– Each urn has a different probability distribution over f

H d th t f th fi t f• He draws the spectrum for the first frame– In which he selects urns according to some probability P0(z)

• Then draws the spectrum for the second framep– In which he selects urns according to some probability P1(z)

• And so on, until he has constructed the entire spectrogram


The Picker Generates a Spectrogram

• The picker has a fixed set of Urns– Each urn has a different probability distribution over f

H d th t f th fi t f• He draws the spectrum for the first frame– In which he selects urns according to some probability P0(z)

• Then draws the spectrum for the second framep– In which he selects urns according to some probability P1(z)

• And so on, until he has constructed the entire spectrogram– The number of draws in each frame represents the energy in that

frame• Actually the total magnitude spectral value

The PLCA Model

• The URNS are the same for every frame– These are the component multinomials or bases for the source that

generated the signal

• The only difference between frames is the probability with which he selects the urns

( ) ( ) ( | )t tzP f P z P f z SOURCE specific

basesFrame‐specificspectral distribution

Frame(time) specific mixture weight


Spectral View of ComponentMultinomials

• For audio each component multinomial (urn) is actually a normalized histogram over frequencies P(f |z)g q (f | )– I.e. a spectrum

• Component multinomials represent spectral building blocks for the given sound source

• The spectrum for every analysis frame is explained as an additive combination of these basic spectral patternscombination of these basic spectral patterns


Learning Bases

• By “learning” the mixture multinomial model for any sound source we “discover” these “basis” spectral structures for the source

• The model can be learnt from spectrograms of training audio p g gfrom the source using the EM algorithm– May not even require large amounts of audio


Likelihood of a Spectrogram)(

)|()(),)((fS

tt

t

zfPzPftfSP

)|()())((

t f ztt fff

zt

t ftt zfPzPfSftfSP )|()(log)(),)((log

• St(f) is the f‐th frequency of the magnitude spectrogram in the t‐th frame

• Once again a direct maximum likelihood estimation is not possible

• EM estimation requiredICASSP 2011 Tutorial: Applications of Topic


Learning the Bases• Simple EM solution

Except bases are learned from all frames– Except bases are learned from all frames

)'|()'()|()()|( t

t fPPzfPzPfzP

'

)'|()'()|(

zt

t zfPzPf

)()|()( f

tt fSfzPP

'

)()|'()(

z ftt

ft fSfzP

zP

)()|( fSfzP

')'()'|(

)()|()|(

f ttt

ttt

fSfzP

fSfzPzfP

The “Basis” distribution




)'|()'()|()()|( t

t fPPzfPzPfzP

'

)'|()'()|(

zt

t zfPzPf

f )()|()( f

tt fSfzPP

'

)()|'()(

z ftt

ft fSfzP

zP

)()|( fSfzP

t


')'()'|(

)()|()|(

f ttt

ttt

fSfzP

fSfzPzfP




)'|()'()|()()|( t

t fPPzfPzPfzP fragmentation

'

)'|()'()|(

zt

t zfPzPf

f )()|()( f

tt fSfzPP

counting

'

)()|'()(

z ftt

ft fSfzP

zP

)()|( fSfzP

t


')'()'|(

)()|()|(

f ttt

ttt

fSfzP

fSfzPzfP


Learning Building BlocksbSpeech Signal bases Basis‐specific spectrograms

)|()( zfPZPt

P(f|z) From Bach’s Fugue in Gm

Frequency

Time

Pt(z)

What about other data

• FacesFaces– Trained 49 multinomial components on 2500 faces

• Each face unwrapped into a 361‐dimensional vector– Discovers parts of facesDiscovers parts of faces


Bag of Frequencies vs. Bag of SpectrogramsSpectrograms

• The PLCA model described is a “bag of frequencies” model– Similar to “bag of words”

• Composes spectrogram one frame at a timeContribution of bases to a frame does not affect other frames– Contribution of bases to a frame does not affect other frames

• Random Variables: – Frequencyq y– Possibly also the total number of draws in a frame

Z f

Nt

Pt(Z)

Nt

Bag of Frequencies PLCA modeltime

T=0: P (Z) T=1: P (Z) T=k: P (Z)T=0: P0(Z) T=1: P1(Z) T=k: Pk(Z)

5

58

5 1

74

4 1 1

74

1

8 4 4

P(f|z)Z=0 Z=1 Z=2 Z=M

• Bases are simple distributions over frequencies• Manner of selection of urns/components varies from analysis frame to analysis frame


Bag of Spectrograms PLCA Model

Z 1 Z 2 Z M

P(T|Z) P(F|Z) P(T|Z) P(F|Z) P(T|Z) P(F|Z)

• Compose the entire spectrogram all at once• Complex “super pots” include two sub pots

Z=1 Z=2 Z=M

• Complex super pots include two sub pots– One pot has a distribution over frequencies: these are our bases– The second has a distribution over time

• Each draw:– Select a superpot– Draw “F” from frequency pot

Z

zfPztPzPftP )|()|()(),(

q y p– Draw “T” from time pot– Increment histogram at (T,F)

The bag of spectrograms

( | ) ( | ) ( | ) ( | ) ( | ) ( | )

DRAW

Z=1 Z=2 Z=M


Z

P(T|Z) P(F|Z)

T FT F

f

fZ

(T,F)

t

f

Repeat N times

t

F T

• Drawing procedure– Fundamentally equivalent to bag of frequencies model

Repeat N times

Z

zfPztPzPftP )|()|()(),(

Fundamentally equivalent to bag of frequencies model • With some minor differences in estimation


Estimating the bag of spectrograms

( | ) ( | ) ( | ) ( | ) ( | ) ( | )

)'|()'|()'()|()|()(),|(ztPzfPzP

ztPzfPzPftzP

Z=1 Z=2 Z=M


?

'

)|()|()(z

ztPzfPzP

)()|'(

)(),|()( t f

t

fSfP

fSftzPzP

f

'

)(),|'()(

z t ft fSftzP

)(),|()|( t

t fSftzPfP

t

')'()',|(

)|(

f tt

t

fSftzPzfP

)(),|( t fSftzP zfPztPzPftP )|()|()(),(

• EM update rules– Can learn all parameters

'' )(),'|(

)|(

t ft

ft

fSftzPztP

Z

– Can learn P(T|Z) and P(Z) only given P(f|Z)– Can learn only P(Z)


Bag of frequencies vs. bag of spectrograms

• Fundamentally equivalenty q• Difference in estimation

– Bag of spectrograms: For a given total N and P(Z), the total “energy” assigned to a basis is determined

• increasing its energy at one time will necessarily decrease its energy elsewhere

– No such constraint for bag of frequencies• More unconstrained

• Bag of frequencies more amenable to imposition of a priori• Bag of frequencies more amenable to imposition of a priori distributions

• Bag of spectrograms a more natural fit for other modelsg p g


The PLCA Tensor Model

Z

P(A|Z) P(B|Z) P(C|Z)

Z

P(A|Z) P(B|Z) P(C|Z)

• The bag of spectrograms can be extended to multivariate data

)|(...)|()|()(),...,( zcPzbPzaPzPcbaPZ

• EM update rules are essentially identical to bivariate case

Z

bivariate case


How many bases

• Previous examples assumed knowledge of b f bnumber of bases

• How do we know the correct number of bases– In general we do not– No good mechanism to learn this automatically

• Must determine as many bases as data/math will allow– With appropriate constraints for best discovery of bases


An Geometric View(0,0,1)

xx = normalized training spectrum

xxxxx x

xxxxx x

o = learned component multinomialo

xx

xx xx

xxxo

o

• Normalized spectrograms/spectral vectors live on probability simplex

(0,1,0)(1,0,0)

Normalized spectrograms/spectral vectors live on probability simplex• ML estimation approximates these as linear combinations of components

– Spectral vectors lying outside the region enclosed by components are modeled with error

A d b KL di b i i d• As measured by KL distance between approximation and true vector• ML estimation learns parameters to minimize this error


PLCA as Matrix DecompositionsBag of frequencies model

Spectrogram = DxT matrixComponents

DxKP(F|Z)

= Weights KxTPt(Z)

Energy:diag(n(t))

Bag of frequencies model

P(F|Z) t( )TxT

B f d l


DxK= Weights KxT( | )

WeightsP(Z)

Bag of spectrograms model

Spectrogram DxT matrix DxKP(F|Z)

= P(T|Z)P(Z)KxK diagonal

• PLCA decomposes the spectrogram matrix as a product of matrices

The first matrix represents multinomial components– The first matrix represents multinomial components– The second represents activations

A Mathematical Constraint


DxK= Weights KxTSpectrogram = DxT matrix DxKP(F|Z)

= Pt(Z) or P(Z)P(T|Z)

• Estimate RHS to minimize KL between LHS and RHS• Unique estimates are possible if K<DUnique estimates are possible if K<D

– Dimensionality reduction• But with non‐zero error

• K >= D will result in non‐unique solutions– The learned bases will be mixed upThe learned bases will be mixed up


No limitation in nature

• Music: Hundreds of instrument types, dozens of notesThousands of bases– Thousands of bases

• Speech: Hundreds of sound patterns• The arithmetic limitation of “K” is artificialThe arithmetic limitation of K is artificial

• Reason for limitation: No requirement for the bases to be “ l ”“complex”– For K>=D bases may be individual frequencies

• The model imposes no constraintp

– “Natural” bases, on the other hand are complex sounds

• Requirement: Learning procedure that permits learning b h h b f di imore bases than the number of dimensions


An Alternate View

• Allow any arbitrary number of bases (urns)

• Specify that for any specific frame only a small number of p y y p f ybases may be used– Although there are many spectral structures, any given frame only

has a few of thesehas a few of these

• Only a small number or urns are used in any frame– The mixture weights must be sparseg p


Sparse overcomplete estimation

DxT DxK=KxT

• Learning more bases than dimensions results in overcompleterepresentationsp

• The activation (second matrix) must be sparse to accommodate– I.e. sparse, overcomplete representationsS ti ti t i f t i i th t i i• Sparse activation matrix => no. of non‐zero entries in the matrix is small– I.e. the columns of the matrix have small L0 norm


A Brief Segue: Compressive Sensing and Sparse Estimationand Sparse Estimation

hY A X

• Given: Y whereY = AX

Dx1

DxKKx1

=

• dim(X) >> dim(Y)– K >> D

1

• Given A, estimate X• Underspecified no unique solution for XUnderspecified, no unique solution for X

– Unique solutions if X is sparse


Sparse Estimation and CS

• Y = AX, Y = Dx1, X = Kx1, D<<K– Given: X = sparse with no more than “S” components

• True solution:S l f “S ” b f X– Solve for every “S‐sparse” subvector of X

– Find the one that results in zero error in Yf l• Requires D >= 2S for unique solution

– NP complete


Sparse Estimation and CS

• Approximate solution 1:Minmize ||Y AX|| 2 + |X|– Minmize ||Y – AX||22 + |X|1

– X value that minimizes both error and L1 norm of X– Guaranteed to result in optimal sparse estimate of X under strict

conditions

• Appoximate solution 2:Appoximate solution 2:– Minmize ||Y – AX||22 constrained to |X|0 = S– IHT, COSAMP, …

• Solutions also applicable to larger class of problems than quadratic error minimizationq


Sparse Estimation of Topic Models• Similar to CS and other sparse estimation problemsproblems– KL distance, rather than squared error

• L minimization does not apply• L1 minimization does not apply– Our data are probability vectors with L1 norm = 1

i i i i h i d l• L0 minimization techniques do apply– With appropriate generalization– But are however often ineffective

• Poor solutions


Sparsity through priors(1,0,0) , very sparse

(a,0,b), somewhat sparse) (a,b,0), somewhat sparse)

(a,b,c)not sparse (0,a,b), somewhat sparse)

(0,1,0) , very sparse(0,0,1) , very sparse

• Mixture weights reside in probability simplex• Corners and edges of probability simplex represent “sparse” regions

• An a priori probability distribution that favors edges and corners will result in sparse estimates


Sparse Priors

• Dirichlet distributions with appropriate parameters can force sparse l tisolutions

– As shown• The objective (probability) surface is shallow in the middle and

steep at sparse points– Unstable

• Instead we will use an entropic prior• Instead, we will use an entropic prior


Entropy as a measure of sparsity

Entropy = 0

Shannon Entropy = log(6) = max

E i l f i

0 1 2 3 4 5 0 1 2 3 4 5

• Entropy is also a measure of sparsity– Fewer possibilities == greater predictability == lower entropy

• Different entropy measures:• Different entropy measures:– Shannon entropy of a distribution = {Pi}:

ii PPH log)(

– Renyi’s entropy (tends to Shannon entropy as 1)

i

ii g)(

PH log1)(

iiPH

log

1)(


The Entropic Prior

))(exp(1)( HJ

P

• J is a normalization factor

J

• For positive , distributions with low entropy (favoring sparsity) have higher probability– Larger promote sparsity more aggressivelyg p p y gg y

• Any definition of entropy is acceptabley py p


Entropic Prior vs. Dirichlet Prior

• Left, Dirichlet prior. Right (Shannon) Entropic priorp

• Entropic prior: smoother surface, promotes sparsity from all points– More stable estimates


Estimation with sparsity• Maximum A Posteriori Estimation

)()|(logmaxarg)()|(maxargˆ PXPPXP )()|(logmaxarg)()|(maxarg PXPPXP

• X is the spectrogram {St(f)}k l h d f

ttt zfPzPfSftfSP )|()(log)(),)((log

• Likelihood of spectrogram:

zt

t ftt ffff )|()(g)(),)((g

• = set of multinomial parameters: {Pt(Z)}, and {P(f|Z)}• Can enforce sparsity on bases P(f|Z) too

ftZfZ ZfPHCZPHCP ))|((exp))((exp)(

zt

• = is a binary 1/0 indicator to specify if sparsity is needed

MAP estimation with SparsityMAP estimation with Sparsity

ZfPZP zfPzPfSZfPZP )|()(log)(maxarg)}|(ˆ{)},(ˆ{ )}|({)}({

zt

t ftZfPZP zfPzPfSZfPZP )|()(log)(maxarg)}|({)},({ )}|({)},({

CZfPHZPHz

fft

tzz )|()(

• Exactly the same as ML estimation, with additional penalty terms derived from entropypenalty terms derived from entropy

• Either mixture weights Pt(Z) or bases P(f|Z) (or both) can be estimated with sparsityp y– By setting z and f to 1 if sparsity is needed and 0 otherwise

• Solutions are very similar to ML estimates with EMICASSP 2011 Tutorial: Applications of Topic


Sparse Estimate with Entropic Priors

')'|()'(

)|()()|(

zt

tt zfPzP

zfPzPfzPFragment and count

z

f

ttt fzPfSzcount )|()()( t

tt fzPfSzfcount )|()()|(

• ML (non‐sparse) estimates:)(Zt )|( Zfcount

')(

)()(

Zt

tt Zcount

ZcountZP

')|(

)|()|(

fZfcount

ZfcountZfP

• Sparse estimates:

)}({,)( ZcountgZP tzsparset )}|({,)|( ZfcountgZfP fsparse


What is gsparse()p

• H() = Shannon entropy

: of iterations}{, isparseg

/

• W() is Lambert’s W function

)/(/)( /1

eWiP

i

i i

i

iPiP

)()(log

• W() is Lambert s W function

• H() = Renyi entropy

: of iterations}{, isparseg

iP )( iQ )(

i

i iPiPiQ

)(

)(1

)(

iiQ

iQiP)(

)()(

Synthetic Example: Estimation with Sparsity

• Top and middle panel: Non‐sparse estimator– As the number of bases increases, bases migrate towards corners of the unit

simplex• Bottom panel: Sparse estimator

– Simplex formed by bases shrinks to fit the data

The Vowels and Music Examples

• Sparsity only applied to mixture weights (though solution is not overcomplete)Sparsity only applied to mixture weights (though solution is not overcomplete)• Left panel, Non‐sparse learning: most bases have significant energy in all frames• Right panel, Sparse learning: Fewer bases active within any frame

Sparsity on components vs. weights• Increasing sparsity of mixture weights makes bases denser

– And vice versa– Preserving net information

Sparse bases High‐entropy mixture weights

High‐entropy bases Sparse mixture weights

Cross‐Entropy as a prior

• So far we have considered sparsity of i di id lindividual parameters– E.g. P(Z), P(f|Z) for each Z, etc.

• We can also impose constraints on relations b diff bbetween different bases– E.g. make P(f|Z) for different Zs maximally diffdifferent

• Make individual bases maximally dissimilar


Cross‐Entropy PriorTh “ t i ”• The “symmetric” cross entropy:

i

iii

ii PQQPQPH loglog),(

• The Shannon cross‐entropic prior:

)'|()|(1)} )|({)}|(({ ZfPZfPHZfPZfPP

– Can manipulate the solution by varying

',21 )'|(),|(exp)},....)|({)},|(({

ZZZfPZfPH

ZZfPZfPP

p y y g

• The objective function to optimize:

zt

t ftZfPZP zfPzPfSZfPZP )|()(log)(maxarg)}|(ˆ{)},(ˆ{ )}|({)},({

CZfPZfPH )'|()|( CZfPZfPHZZ

',

)|(),|(


Estimation with Cross‐Entropic Priors

ZZ

ZfPZfcountZfP '

)'|()|()|(

f ZZZfPZfcount

ZfP

')'|()|(

)|(

• Can be extended to minimize cross‐entropy between groups of basesg p

ZfPZfcount )'|()|(

f Z

Za

b

b

ZfPZfcount

ZfPZfcountZfP

Z

ZZ)'|()|(

)'|()|()|(

f Z bZ


Temporal Priors

P(f|Z2)

P(f|Z3)

P(f|Z1)

Spectral trajectory

• Other mechanisms may impose “temporal” priorsI i t l t i t th t j t– Imposing temporal constraints on the trajectory through the simplex

– More from Paris– More from Paris


The limits of Sparsity

DxT DxK=KxT

• Sparse estimation permits K>>D• The largest value with unique solutions is K=T

– The training data themselves become basesg– Weights matrix becomes an identity matrix

• For K>T the solution becomes indeterminate• For K>T the solution becomes indeterminateICASSP 2011 Tutorial: Applications of Topic


Example based representation

• Use training vectors themselves as bases• A justification:

– “Learning” – learned bases are linear combinations of normalized data– Learning learned bases are linear combinations of normalized data• They can lie in regions not visited by data

– A data based representation has no such restrictions

• Need not store all training vectors– Random sampling is sufficiently effective

• The “building blocks” metaphor is no longer valid, however


Patterns Beyond a Record

• Techniques talked about so far are effective at extracting structure

• However structure is discovered at the level ofHowever structure is discovered at the level of entire records– Each basis spans the entire frequency axisEach basis spans the entire frequency axis

• What about structures that extend beyond aWhat about structures that extend beyond a record?– Or substructures within a record?Or substructures within a record?


Patterns extend beyond a single frame

• Four bars from a music example• The spectral patterns are actually patchesp p y p

– Not all frequencies fall off in time at the same rate• The basic unit is a spectral patch, not a spectrum• Extend model to consider this phenomenon


Shift‐Invariant Model

P(T|Z) P(t,f|Z) P(T|Z) P(t,f|Z) P(T|Z) P(t,f|Z)

• Employs bag of spectrograms model

Z=1 Z=2 Z=M

Employs bag of spectrograms model• Each “super‐pot” has two sub pots

One subpot now stores bi variate distributions– One subpot now stores bi‐variate distributions• Each ball has a (t,f) pair marked on it – the bases

Balls in the other subpot merely have a time “T”– Balls in the other subpot merely have a time T marked on them – the “location”

The shift‐invariant model

( | ) ( | ) ( | ) ( | ) ( | ) ( | )

DRAW

Z=1 Z=2 Z=M

P(T|Z) P(t,f|Z) P(T|Z) P(t,f|Z) P(T|Z) P(t,f|Z)

Z

P(T|Z) P(t,f|Z)

T t fT t,f

f

f

(T+t,f)

t

f

Repeat N times

t

Repeat N times

zftTPzTPzPftP )|()|()()( Z T

zftTPzTPzPftP )|,()|()(),(ICASSP 2011 Tutorial: Applications of Topic


Estimating ParametersEstimating Parameters

• Maximum likelihood estimate followsMaximum likelihood estimate follows fragmentation and counting strategy

• Two step fragmentation• Two‐step fragmentation– Each instance is fragmented into the super potsTh f i h i f h– The fragment in each super‐pot is further fragmented into each time‐shift

• Since one can arrive at a given (t f) by selecting any T• Since one can arrive at a given (t,f) by selecting any T from P(T|Z) and the appropriate shift t‐T from P(t,f|Z)


Shift invariant model: Update Rules• Given data (spectrogram) S(t,f)• Initialize P(Z), P(T|Z), P(t,f | Z)• Iterate

)|,()|()|,,( )|,()|()(),,(T

ZfTtPZTPZftTPZfTtPZTPZPZftP

''

)|,','()|,,(),,|(

)',,(),,(),|(

TZ

T

ZfTtTPZfTtTPftZTP

ZftPZftPftZP Fragment

'' TZ

)()|'()|(

),(),,|(),|(

)|( )()|'(

),(),|(

)( t ft f

ftSftZTPftZP

ftSftZTPftZP

ZTPftSftZP

ftSftZP

ZP

''

),(),,|(),|(

),(),,|'(),|(),(),|'(

T

T t fZ t f

fTSfTZtTPfTZP

ftSftZTPftZPftSftZP


'

),(),,|'(),|()|,(

t T

T

fTSfTZtTPfTZPZftP Count

An Example

• Two distinct sounds occuring with different repetition rates within a signal

INPUT SPECTROGRAM

Discovered “patch” bases

Contribution of individual bases to the recording

Shift‐Invariance in Two dimensionsShift Invariance in Two dimensions

• Patterns may be substructures– Repeating patterns that may occur anywherep g p y y

• Not just in the same frequency or time location• More apparent in image data


Two‐D Shift Invariance: Estimation• Fragment and count strategy• Fragment into superpots, but also into each T and FFragment into superpots, but also into each T and F

– Since a given (t,f) can be obtained from any (T,F)

)|,()|,()|,,,( )|,()|,()(),,(FT

ZFfTtPZFTPZftFTPZFfTtPZFTPZPZftP

',''

,

)|',',','()|,,,(),,|,(

)',,(),,(),|(

FTZ

FT

ZFfTtFTPZFfTtFTPftZFTP

ZftPZftPftZP

Fragment

),(),,|','(),|(

),(),,|,(),|(

)|,( ),(),|'(

),(),|(

)( t ft f

ftSftZFTPftZP

ftSftZFTPftZP

ZFTPftSftZP

ftSftZP

ZP

,

,

' ''

),(),,|,(),|(

)|(

),(),,|,(),|(),(),|(

FT

T F t fZ t f

FTSFTZfFtTPFTZP

ZftP

fffff

',' ,

),(),,|','(),|()|,(

ft FT

FTSFTZfFtTPFTZPZftP

CountICASSP 2011 Tutorial: Applications of Topic


Shift‐Invariance: CommentsShift Invariance: Comments• P(T,F|Z) and P(t,f|Z) are symmetric

– Cannot control which of them learns patterns and which the locations

• Answer: Constraints• Answer: Constraints– Constrain the size of P(t,f|Z)

• I.e. the size of the basic patch

– Impose sparsity on location P(T,F|Z)• Patches occur only occasionally and their locations are inherently sparsesparse

– Sparsity is obtained simply by employing gsparse() on the counts, as before


Shift‐Invariance in Many Dimensions

• The generic notion of “shift invariance” can be• The generic notion of shift‐invariance can be extended to multivariate data– Not just two‐D data like images and spectrograms

• Shift invariance can be applied to any subset of variablesof variables


Example: 2‐D shift invariance

• Sparse decomposition employed in this example– Otherwise locations of faces (bottom right panel) are not precisely determined

Example: 3‐D shift invariance

• The original figure has multiple handwritten renderings of three characters– In different colours

• The algorithm learns the three characters and identifies their locations in the figureg

Input datascovered

Patches

Dis P

Patch

ocations

Lo


Beyond shift‐invariance: transform invariance

• The draws from the urns may not only be shifted, but also transformedTh ith ti i i il t th hift i i t• The arithmetic remains very similar to the shift‐invariant model– Wemust now impose one of an enumerated set of transformsWe must now impose one of an enumerated set of transforms

to (t,f), after shifting them by (T,F)– In the estimation, the precise transform applied is an unseen

variablevariable


Example: Transform Invariance

• Top left: Original figure• Bottom left – the two bases discovered• Bottom right –

– Left panel, positions of “a”– Right panel positions of “l”Right panel, positions of l

• Top right: estimated distribution underlying original figureICASSP 2011 Tutorial: Applications of Topic


Relationship to Other TechniquesRelationship to Other Techniques

• PCA/ICA: NonePCA/ICA: None– Topic model does not impose orthogonality or independence constraintsindependence constraints

• Although they can be imposed via other priors

• PARAFAC: – Tensor models provide multivariate pdecompositions similarly to Parafac

• Minimizing KL divergence, rather than L2


Relationship to other Techniques

• NMF:– Basic PLCA provably identical to NMF within a scaling constant

– However, provides handle for additional statistical framework

E t i d t i i• Entropic and cross‐entropic priors• Anti priors.

Nevertheless fundamentally similar– Nevertheless fundamentally similar• With greater mathematical elegance and ease of algorithm development


Over to ParisOver to Paris


Topic Models for Signal Processing - Carnegie Mellon University

Documents

Transcript of Topic Models for Signal Processing - Carnegie Mellon University