Data Association for Topic Intensity Tracking

34
Data Association for Topic Intensity Tracking Andreas Krause Jure Leskovec Carlos Guestrin School of Computer Science, Carnegie Mellon University

description

Data Association for Topic Intensity Tracking. Andreas Krause Jure Leskovec Carlos Guestrin School of Computer Science, Carnegie Mellon University. Document classification. Two topics: Conference and Hiking. Will you go to ICML too?. Let’s go hiking on Friday!. P( C | words) = .1. - PowerPoint PPT Presentation

Transcript of Data Association for Topic Intensity Tracking

Page 1: Data Association for Topic Intensity Tracking

Data Association for Topic Intensity Tracking

Andreas KrauseJure Leskovec

Carlos Guestrin

School of Computer Science, Carnegie Mellon University

Page 2: Data Association for Topic Intensity Tracking

Document classification Two topics: Conference and Hiking

Will you go toICML too?

Let’s go hikingon Friday!

P(C | words) = .9 P(C | words) = .1

Conference Hiking

Page 3: Data Association for Topic Intensity Tracking

A more difficult example Two topics: Conference and Hiking

What if we had temporal information? How about modeling emails as HMM?

Let’s have dinnerafter the talk.

Should we go onFriday?

P(C | words) = .7 P(C | words) = .5

Could refer to both topics!

2:00 pm 2:03 pm

Conference

C1

D1

C2

D2

Ct

Dt

Ct+1

Dt+1

Assumes equal time steps,“smooth” topic changes.Valid assumptions?

Page 4: Data Association for Topic Intensity Tracking

Typical email traffic

Email traffic very bursty Cannot model with uniform time steps! Bursts tell us, how intensely a topic is pursued

Bursts are potentially very interesting!

0 50 100 150 2000

2

4

6

8

10

12

Time (days)

Num

ber

of e

mai

ls

Page 5: Data Association for Topic Intensity Tracking

Identifying both topics and bursts Given:

A stream of documents (emails):

d1, d2, d3, …

and corresponding document inter-arrival times (time between consecutive documents):

Δ1, Δ2, Δ3, ...

Simultaneously: Classify (or cluster) documents into K topics Predict the topic intensities – predict time between

consecutive documents from the same topic

Page 6: Data Association for Topic Intensity Tracking

Data association problem If we know the email topics, we can identify bursts

If we don’t know the topics, we can’t identify bursts! Naïve solution: First classify documents, then identify bursts

Can fail badly! This paper:

Simultaneously identify topics and bursts!

High intensity for “Conference”,Low intensity for “Hiking”

Low intensity for “Conference”,High intensity for “Hiking”

time

Intensity for “Conference” ???Intensity for “Hiking” ???

Conference

Hiking

Page 7: Data Association for Topic Intensity Tracking

The Task Have to solve a data association problem: We observe:

Message Deltas – time between the arrivals of consecutive documents

We want to estimate: Topic Deltas – time between messages of the same topic We can then compute the topic intensity L =

E[ 1/ Therefore, need to associate each document with a topicNeed topics to

identify intensity

Need intensity toclassify (better)

Chicken and Eggproblem:

Page 8: Data Association for Topic Intensity Tracking

How to reason about topic deltas? Associate with each email timestamps vectors

for topic arrivals

C: 2:00 pmH: 2:30 pm

Email 1,ConferenceAt 2:00 pm

C: 4:15pmH: 2:30 pm

Email 2,HikingAt 2:30 pm

C: 4:15 pmH: 7:30 pm

Email 3,ConferenceAt 4:15 pm

Next arrivalof email fromConference,Hiking

Timestamp vector [t(C),t(H)]

Message = 30min (betw. consecutivemessages)

Topic = 2h 15min (consecutive msg. of same topic)

τ1 τ2 τ3

Page 9: Data Association for Topic Intensity Tracking

L(H)t-1 L(H)

t L(H)t+1

τt-1 τt τt+1

Ct DtΔt

Generative Model (conceptual)L(C)

t-1 L(C)t L(C)

t+1

t = [t(C),t

(H)]Time for next email

from topic(exponential dist.)

Time betweensubsequent

emails

Topic indicator

(i.e., = “Conference”)

Document(e.g., bag of

words)

Intensity for“Conference”(parameter forexponential d.)

Intensity for“Hiking”

(parameter forexponential d.)

Problem:

Need to reason about entire history of timesteps t!Makes inference intractable, even for few topics!

Page 10: Data Association for Topic Intensity Tracking

Key observation: If topic follow exponential distribution:

P(t+1(C) > 4pm | t

(C) = 2pm, it’s now 3pm) = P(t+1

(C) > 4pm | t (C) = 3pm, it’s now 3pm)

Exploit memorylessness to discard timestamps t

Exponential distribution appropriate: Previous work on document streams (E.g., Kleinberg ‘03) Frequently used to model transition times When adding hidden variables, can model arbitrary

transition distributions (cf., Nodelman et al)

Last arrival time irrelevant!

Page 11: Data Association for Topic Intensity Tracking

L(H)t-1 L(H)

t L(H)t+1

τt-1 τt τt+1

Ct DtΔt

Generative Model (conceptual)L(C)

t-1 L(C)t L(C)

t+1

t = [t(C),t

(H)]Time for next

email from topic(exponential

dist.)

Time betweenSubsequent

emails

Topic indicator

(i.e., = “Conference”)

DocumentRepresentation

(words)

Intensity for“Conference”

Intensity for“Hiking”

Implicit Data Association (IDA) Model

Page 12: Data Association for Topic Intensity Tracking

Key modeling trick Implicit data association (IDA) via

exponential order statistics

P(t | Lt) = min { Exp(Lt(C)), Exp(Lt

(H)) }

P(Ct | Lt) = argmin { Exp(Lt(C)), Exp(Lt

(H)) }

Simple closed form for these order statistics! Quite general modeling idea Turns model (essentially) into Factorial HMM Many efficient inference techniques available!

C: 2:00 pmH: 2:30 pm

Email 1,ConferenceAt 2:00 pm

C: 4:15pmH: 2:30 pm

Email 2,HikingAt 2:30 pm

C: 4:15 pmH: 7:30 pm

Email 3,ConferenceAt 4:15 pm

L(H)t

Ct DtΔt

L(C)t

Page 13: Data Association for Topic Intensity Tracking

Inference Procedures We consider:

Full (conceptual) model:Particle filter

Simplified Model:Particle filterFully factorized mean fieldExtract inference

Comparison to a Weighted Automaton Model for single topics, proposed by Kleinberg (first classify, then identify bursts)

Page 14: Data Association for Topic Intensity Tracking

Results (Synthetic data) Periodic message arrivals (uninformative Δ) with

noisy class assignments: ABBBABABABBB…

Misclassification Noise

0 20 40 60 800

5

10

15

20

25

30

Message number

To

pic

del

ta

Topic

Page 15: Data Association for Topic Intensity Tracking

Results (Synthetic data) Periodic message arrivals (uninformative Δ) with

noisy class assigments: ABBBABABABBB…

Misclassification Noise

0 20 40 60 800

15

20

25

30

Message number

To

pic

del

ta

Topic Part. Filt.(Full model)

Page 16: Data Association for Topic Intensity Tracking

Results (Synthetic data) Periodic message arrivals (uninformative Δ) with

noisy class assigments: ABBBABABABBB…

Misclassification Noise

0 20 40 60 800

15

20

25

30

Message number

To

pic

del

ta

Topic Part. Filt.(Full model)

Exactinference

Page 17: Data Association for Topic Intensity Tracking

Results (Synthetic data) Periodic message arrivals (uninformative Δ) with

noisy class assigments: ABBBABABABBB…

Misclassification Noise

0 20 40 60 800

5

10

15

20

25

30

Message number

To

pic

del

ta

Topic Part. Filt.(Full model)

Exactinference

Weighted automaton(first classify, then bursts)

Implicit Data Association get’s both topics and frequencies right, inspite severe (30%) label noise.

Memorylessness trick doesn’t hurt

Separate topic and burst identification fails badly.

Page 18: Data Association for Topic Intensity Tracking

Inference comparison (Synthethic data) Two topics, with different frequency pattern

0 50 100 150 200 250 3000

50

100

150

200

250

Message number

Top

ic d

elta

Topic

Page 19: Data Association for Topic Intensity Tracking

Inference comparison (Synthethic data) Two topics, with different frequency pattern

0 50 100 150 200 250 3000

50

100

150

200

250

Message number

Top

ic d

elta

Topic

Message

Page 20: Data Association for Topic Intensity Tracking

Inference comparison (Synthethic data) Two topics, with different frequency pattern

0 50 100 150 200 250 3000

50

100

150

200

250

Message number

Top

ic d

elta

Topic

Message

Exactinference

Page 21: Data Association for Topic Intensity Tracking

Inference comparison (Synthethic data) Two topics, with different frequency pattern

0 50 100 150 200 250 3000

50

100

150

200

250

Message number

Top

ic d

elta

Topic

Message

Exactinference

Particlefilter

Page 22: Data Association for Topic Intensity Tracking

Inference comparison (Synthethic data) Two topics, with different frequency pattern

0 50 100 150 200 250 3000

50

100

150

200

250

Message number

Top

ic d

elta

Topic

Message

Exactinference

Particlefilter

Mean-field

Implicit Data Association identifies true frequency parameters (does not get distracted by observed )

In addition to exact inference (for few topics),several approximate inference techniques perform well.

Page 23: Data Association for Topic Intensity Tracking

Experiments on real document streams ENRON Email corpus

517,431 emails from 151 employees Selected 554 messages from tech-memos and

universities folders of Kaminski Stream between December 1999 and May 2001

Reuters news archive Contains 810,000 news articles Selected 2,303 documents from four topics:

wholesale prices, environment issues, fashion and obituaries

Page 24: Data Association for Topic Intensity Tracking

Intensity identification for Enron data

0 50 100 150 200 250 3000

10

20

30

40

50

60

70

Message number

Top

ic d

elta

Topic

Page 25: Data Association for Topic Intensity Tracking

Enron data

0 50 100 150 200 250 3000

10

20

30

40

50

60

70

Message number

Top

ic d

elta

Topic WAM

Page 26: Data Association for Topic Intensity Tracking

Enron data

0 50 100 150 200 250 3000

10

20

30

40

50

60

70

Message number

Top

ic d

elta

Topic WAMIDA-IT

Page 27: Data Association for Topic Intensity Tracking

Enron data

0 50 100 150 200 250 3000

10

20

30

40

50

60

70

Message number

Top

ic d

elta

Implicit Data Association identifies bursts which are missed by Weighted Automaton Model (separate approach)

Topic WAMIDA-IT

Page 28: Data Association for Topic Intensity Tracking

Reuters news archive

Again, simultaneous topic and burst identification outperforms separate approach

0 100 200 300 400 500 6000

10

20

30

40

50

Message numberT

opic

del

ta

TrueWAMIDA-IT

0 100 200 300 400 500 600 7000

10

20

30

40

50

Message number

To

pic

de

lta

Topic WAMIDA-IT

Page 29: Data Association for Topic Intensity Tracking

What about classification? Temporal modeling effectively changes class

prior over time. Impact on classification accuracy?

Page 30: Data Association for Topic Intensity Tracking

Classification performance

Modeling intensity leads to improved classification accuracy

NaïveBayes

IDAModel

Page 31: Data Association for Topic Intensity Tracking

Generalizations Learning paradigms

Not just supervised setting, but also: Unsupervised- / semisupervised learning Active learning (select most informative labels) See paper for details.

Other document representations Other applications

Fault detection Activity recognition …

Page 32: Data Association for Topic Intensity Tracking

L(H)t-1 L(H)

t L(H)t+1

Ct DtΔt

Topic trackingL(C)

t-1 L(C)t L(C)

t+1

Time betweenSubsequent

emails

Topic indicator

(i.e., = “Conference”)

DocumentRepresentation

(LSI)

Intensity for“Conference”

Intensity for“Hiking”

tt-1 t+1

Topic param.(Mean for LSI

representation)

t tracks topic means (Kalman Filter)

Page 33: Data Association for Topic Intensity Tracking

Conclusion General model for data association in

data streams

A principled model for “changing class priors” over time

Can be used in supervised, unsupervised and (semisupervised) active learning setting

Page 34: Data Association for Topic Intensity Tracking

Conclusion Surprising performance of simplified IDA model

Exponential order statistics enable implicit data association and tractable exact inference

Synergetic effect between intensity estimation and classification on several real-world data sets