A probabilistic model for retrospective news event detection Zhiwei Li, Bin Wang, Mingjing Li and...

25
A probabilistic model for retrospective news event detection Zhiwei Li, Bin Wang, Mingjing Li and We i-Ying Ma. A probabilistic model for ret rospective news event detection. In the 28 th Annual International ACM SIGIR Conference ( SIGIR'2005 ), 2005. Presenter: Suhan Yu

Transcript of A probabilistic model for retrospective news event detection Zhiwei Li, Bin Wang, Mingjing Li and...

Page 1: A probabilistic model for retrospective news event detection Zhiwei Li, Bin Wang, Mingjing Li and Wei-Ying Ma. A probabilistic model for retrospective.

A probabilistic model for retrospective news event detection

Zhiwei Li, Bin Wang, Mingjing Li and Wei-Ying Ma. A probabilistic model for retrospective news event detection. In the 28th Annual International ACM SIGIR Co

nference (SIGIR'2005), 2005.Presenter: Suhan Yu

Page 2: A probabilistic model for retrospective news event detection Zhiwei Li, Bin Wang, Mingjing Li and Wei-Ying Ma. A probabilistic model for retrospective.

Introduction

• RED– Retrospective news event detection (RED) is defined as the

discovery of previously unidentified event in historical news corpus.

• News event definition– a specific thing happens at a specific place and time.– Consecutively reported by many news articles in a period.

Page 3: A probabilistic model for retrospective news event detection Zhiwei Li, Bin Wang, Mingjing Li and Wei-Ying Ma. A probabilistic model for retrospective.

Introduction

• Observation:– A news article contains two kinds of information:

• Contents (most previous research work focus)• Timestamps (often ignored)

• This paper contribution include:– Proposing a multi-modal RED algorithm (use content and time

info)– Proposing an approach to determine the approximate number of

events from the articles count-time distribution.

Page 4: A probabilistic model for retrospective news event detection Zhiwei Li, Bin Wang, Mingjing Li and Wei-Ying Ma. A probabilistic model for retrospective.

Characteristics of news articles and events

• Halloween topics contains many events– Each year’s Halloween is an event.

• The figure indicates the

two most important

characteristics– Events are peaks, but in

some situations, several

events could be overlapped

on time.– The start and end time of

reports to events on

different website are very

similar.

event

Page 5: A probabilistic model for retrospective news event detection Zhiwei Li, Bin Wang, Mingjing Li and Wei-Ying Ma. A probabilistic model for retrospective.

Multi-modal retrospective news event detection method

• Representation of news articles and news events– News articles represented by four kinds of information:

• Who (person)• Where (location)• What (keywords)• When (time) --define as the period between the first article

and the last article. (Time consists two values)– Define news article and event as:

• The four kinds of information of a news article are independent:

timekeywordslocationspersonsevent

timekeywordslocationspersonsarticle

,,,

,,,

)()()()()( timepkeywordsplocationsppesonsparticlep

Page 6: A probabilistic model for retrospective news event detection Zhiwei Li, Bin Wang, Mingjing Li and Wei-Ying Ma. A probabilistic model for retrospective.

The generative model of news articles

• Contents– Unigram models to model contents– Model persons, locations and keywords by three models.

• Timestamps– Gaussian Mixture Model (GMM) is chosen to model timestamps.

• A peak is usually modeled by a Gaussian function, where the mean is the position of the peak and the variance is the duration of event.

Page 7: A probabilistic model for retrospective news event detection Zhiwei Li, Bin Wang, Mingjing Li and Wei-Ying Ma. A probabilistic model for retrospective.

The generative model of news articles

N=term space size

Page 8: A probabilistic model for retrospective news event detection Zhiwei Li, Bin Wang, Mingjing Li and Wei-Ying Ma. A probabilistic model for retrospective.

Learning model parameters

• The model parameter can be estimated by Maximum Likelihood method.

– X represents the corpus of news articles.– M and k are number of news articles and number of events.

• Given an event j, the four kinds of information of the i-th article are conditional independent:

• EM algorithm is generally applied to maximize log-likelihood.

M

i

k

jjij

M

ii expepxpXpXl

1 11

)),()(log())(log())(log();(

)()()()()( jijijijiji ekeywordspelocationspepersonspetimepexp

Page 9: A probabilistic model for retrospective news event detection Zhiwei Li, Bin Wang, Mingjing Li and Wei-Ying Ma. A probabilistic model for retrospective.

Maximize log-likelihood

• E-step

• M-step (update parameters)

)()(

)()(

)(

)()()(

)()(

)(

)()(

)1(

rir

r

tji

tj

ti

tji

tjt

ij expep

expep

xp

expepxep

N

s

M

i

tij

M

i

tij

tjn

sitfxepN

nitfxepewp

11

)1(

1

)1(

)1(

)),()((

),()(1)(

iarticlenewsxi

Word n. Like person=Mary

Vocabulary size articlesallM

in xw

ofcount

in entity

Page 10: A probabilistic model for retrospective news event detection Zhiwei Li, Bin Wang, Mingjing Li and Wei-Ying Ma. A probabilistic model for retrospective.

Maximize log-likelihood

• M-step– Parameters of the GMM

M

i

tij

M

i it

ijtj

xep

timexep

1

)1(

1

)1()1(

)(

)(

M

i

tij

M

i

tji

tijt

jxep

timexep

1

)1(

2

1

)1()1()1(

)(

)()(

M

xepep

M

i

tijt

j

1

)1()1(

)()(

articlesallM

mean

variances

Page 11: A probabilistic model for retrospective news event detection Zhiwei Li, Bin Wang, Mingjing Li and Wei-Ying Ma. A probabilistic model for retrospective.

How many events?

• We assume only the salient peaks are corresponding to events.– Initial estimate of events number can be set as the number of

peaks• Use hill-climbing approach to detect all peaks• Compute salient score for each of them• The top 20% peaks are defined as salient peaks.• Spitting/merging initial

peaks

to detect salient peaks,

we define salient scores

for peaks as:)()()( peakrightpeakleftpeakscore

Page 12: A probabilistic model for retrospective news event detection Zhiwei Li, Bin Wang, Mingjing Li and Wei-Ying Ma. A probabilistic model for retrospective.

Splitting/merging initial salient peaks

• MDL (Minimum Description Length)

))log(2

));((log(maxarg Mm

Xpk k

k

)1()1()1(13 nlpk NkNkNkkm

penalty

articlesallM

Np=person vocabulary size

Page 13: A probabilistic model for retrospective news event detection Zhiwei Li, Bin Wang, Mingjing Li and Wei-Ying Ma. A probabilistic model for retrospective.

Event summarization

• Maximum a Posterior (MAP)

– is the label of news article

))((maxarg ijj

i xepy

iy ix

Page 14: A probabilistic model for retrospective news event detection Zhiwei Li, Bin Wang, Mingjing Li and Wei-Ying Ma. A probabilistic model for retrospective.

Algorithm summary

Page 15: A probabilistic model for retrospective news event detection Zhiwei Li, Bin Wang, Mingjing Li and Wei-Ying Ma. A probabilistic model for retrospective.

Multi-modal RED algorithm application

• HISCOVERY system– HISCOVERY (HIStory disCOVERY)– Two useful function

• Photo Story• Chronicle

– News article come from 12 news sites (such as CNN, MSNBC, BBC…)

Page 16: A probabilistic model for retrospective news event detection Zhiwei Li, Bin Wang, Mingjing Li and Wei-Ying Ma. A probabilistic model for retrospective.

HISCOVERY system

Page 17: A probabilistic model for retrospective news event detection Zhiwei Li, Bin Wang, Mingjing Li and Wei-Ying Ma. A probabilistic model for retrospective.

Experimental methods

• Data– TDT

• Benchmarks for event detection. – TDT4

• Run experiments• Contain 80 events annotated from 28500 news articles.• These articles collected from the period of 2000/10~2001/1

• Each year’s reports can be

regarded as an events.• Extracting named entities.

– Extracted by BBN NLP tool,

which can extract seven

types of named entities.

Page 18: A probabilistic model for retrospective news event detection Zhiwei Li, Bin Wang, Mingjing Li and Wei-Ying Ma. A probabilistic model for retrospective.

Experimental design

• To compare the approach with other algorithm:– Group Average Clustering (GAC)

• It is the best algorithm in TDT evaluations.• A hierarchical clustering method

• Baseline– kNN algorithm

Page 19: A probabilistic model for retrospective news event detection Zhiwei Li, Bin Wang, Mingjing Li and Wei-Ying Ma. A probabilistic model for retrospective.

Results

• Probabilistic model

gains the best

results, but the

improvement are

not significant.

Page 20: A probabilistic model for retrospective news event detection Zhiwei Li, Bin Wang, Mingjing Li and Wei-Ying Ma. A probabilistic model for retrospective.

Results

• Named entities

Page 21: A probabilistic model for retrospective news event detection Zhiwei Li, Bin Wang, Mingjing Li and Wei-Ying Ma. A probabilistic model for retrospective.

result

Page 22: A probabilistic model for retrospective news event detection Zhiwei Li, Bin Wang, Mingjing Li and Wei-Ying Ma. A probabilistic model for retrospective.

result

Page 23: A probabilistic model for retrospective news event detection Zhiwei Li, Bin Wang, Mingjing Li and Wei-Ying Ma. A probabilistic model for retrospective.

result

39 events

Page 24: A probabilistic model for retrospective news event detection Zhiwei Li, Bin Wang, Mingjing Li and Wei-Ying Ma. A probabilistic model for retrospective.

result

46 events

Page 25: A probabilistic model for retrospective news event detection Zhiwei Li, Bin Wang, Mingjing Li and Wei-Ying Ma. A probabilistic model for retrospective.

Conclusion

• Study 2 characteristics of news articles and events.• Proposed a multi-modal RED algorithm

• Future work:– Use fitful dynamic models to model news events.

• HMM• ICA (Independent components analysis)