Topic Detection and Tracking : Event Clustering as a Basis for First Story Detection AI-Lab Jung...

19
Topic Detection and Tracking : Event Clustering as a Basis for First Story Detection AI-Lab Jung Sung Won

Transcript of Topic Detection and Tracking : Event Clustering as a Basis for First Story Detection AI-Lab Jung...

Page 1: Topic Detection and Tracking : Event Clustering as a Basis for First Story Detection AI-Lab Jung Sung Won.

Topic Detection and Tracking :Event Clustering as a Basisfor First Story Detection

AI-LabJung Sung Won

Page 2: Topic Detection and Tracking : Event Clustering as a Basis for First Story Detection AI-Lab Jung Sung Won.

Abstract

Topic Detection and Tracking(TDT) The organization of information by event than b

y subject In this paper

Overview of the TDT research program Discuss our approach to two of the TDT proble

ms Event clustering (Detection) First story detection

Page 3: Topic Detection and Tracking : Event Clustering as a Basis for First Story Detection AI-Lab Jung Sung Won.

Introduction

Information Retrieval Research Texts are usually indexed, retrieved and

organized on the basis of their subjects. This research

Focusing on the events that are described by the text rather than the broader subject it covers.

What is the major event discussed within this story?

Do these texts discuss the same event? Not all texts can be reduced to a set of events This work will necessarily apply only to text that

have an event focus : announcements, news

Page 4: Topic Detection and Tracking : Event Clustering as a Basis for First Story Detection AI-Lab Jung Sung Won.

Topic Detection and Tracking

연구의 목적 방송 기사들을 사건으로 구성하기 위하여 기사의 수집처 : 텔레비전 , 라디오 ,

유선방송 Automatic speech recognition (ASR) 필요 TDT 연구는 3 차례에 걸처 이루어짐

TDT-1 : 1996 년 중반 ~ 1997 년 TDT-2 : 1998 년 TDT-3 : 1999 년

Page 5: Topic Detection and Tracking : Event Clustering as a Basis for First Story Detection AI-Lab Jung Sung Won.

TDT-1, The Pilot Study(1/2)

Proof-of-concept 을 위한 노력 1st Project : definition of the problem

Some unique thing that happens at some point in time. (Allan et al., 1998a)

Ex) “computer virus detected at British Telecom, March 3, 1993” ↔ “computer virus outbreaks”

Definition of three research problems Segmentation Detection

Event clustering First story detection

Tracking

Page 6: Topic Detection and Tracking : Event Clustering as a Basis for First Story Detection AI-Lab Jung Sung Won.

TDT-1, The Pilot Study(2/2)

created a small evaluation corpus 15,683 개의 뉴스기사 ( CNN, Reuter )

generated a set of 25 news topics employed a two-prong method for

assigning relevance judgments between topics and stories 1 group : read every story in the corpus 2 group : used a search engine to look

for stories on each of the 25 topics

Page 7: Topic Detection and Tracking : Event Clustering as a Basis for First Story Detection AI-Lab Jung Sung Won.

TDT-2, A Full Evaluation

The Primary goal Create a full-scale evaluation of the TDT tasks began in

the pilot study. Change

The two detection tasks were “merged” to create an on-line version of the event clustering task.

TDT-1 와 다른 점은 small group 으로 처리 수정된 평가 방법 사용 ( Detection Error Tradeoff graph

s) 큰 규모의 corpus 를 사용

Page 8: Topic Detection and Tracking : Event Clustering as a Basis for First Story Detection AI-Lab Jung Sung Won.

TDT-3, Multi-Lingual TDT

TDT-2 와 다른 점 Task 를 고려하여 구성하는 event 의 범위가

다르다 . multi-lingual 소스의 도입 New evaluation corpus.

Page 9: Topic Detection and Tracking : Event Clustering as a Basis for First Story Detection AI-Lab Jung Sung Won.

On-Line Clustering Algorithms

Previous clustering work Retrospective environment

In this study On-line solution to first story detection.

Steps of Clustering Converting stories to a vector Comparing stories and clusters Applying a threshold to determine if clusters are

sufficiently similar Classifier : the combination of vector and

threshold

Page 10: Topic Detection and Tracking : Event Clustering as a Basis for First Story Detection AI-Lab Jung Sung Won.

Related Clustering Work

기존 clustering 기법들 Agglomerative hierarchical clustering Probabilistic approaches 이 기법들은 clustering 할 대상이 미리 있음

On-line environment 에서의 한계점 Clustering 할 대상이 미리 없음 일부 알고리즘은 클러스터의 수를 정해야 하나 on-line

에서는 클러스터의 수를 알 수 없음 Single-pass clustering 을 할 수 있는 알고리즘이

필요 → 이미 몇몇 알고리즘이 있음

Page 11: Topic Detection and Tracking : Event Clustering as a Basis for First Story Detection AI-Lab Jung Sung Won.

Creating Story Vectors

INQUERY 를 사용해서 weight vector 구성

t : 기사에 특정 lexical feature 등장 횟수 dl : the story’s length in words avg_dl : 기사 내의 term 개수 평균 C : 보조 corpus 내의 기사의 수 df : term 이 나타난 기사의 수 ( df=0 이면 df=1) k : classifier 와 기사에 동시에 나타나는 단어의 index dj,k : 시간 j 에 나타날 기사의 유사도

dlavg

dlt

ttf

_5.15.0

)1log(

5.0log

C

df

C

idf kkkj idftfd 6.04.0,

Page 12: Topic Detection and Tracking : Event Clustering as a Basis for First Story Detection AI-Lab Jung Sung Won.

Comparing Clusters

Comparing a story to a cluster or the contents of two clusters Single-link, complete-link, group-average

N

k ki

N

k kjki

jiq

dqdqsim

1 ,

1 ,,),(

Page 13: Topic Detection and Tracking : Event Clustering as a Basis for First Story Detection AI-Lab Jung Sung Won.

Thresholds for Merging

Threshold 를 쓰는 이유 The decision for generating a new

cluster. Clustering for first story detection

Ex) threshold 0.5 인 경우

Time-based thresholds 실제 news 의 시간적 특성 고려 시간 i 에 계산된 classifier 에서 시간 j 에 도달한 어떤 기사를 위한

threshold 는

Decision Scores

)()4.0),((4.0),( ijjiji datedatedqsimdqthreshold

),(),(),( jijiji dqthresholddqsimdqdecision

Page 14: Topic Detection and Tracking : Event Clustering as a Basis for First Story Detection AI-Lab Jung Sung Won.

Experimental Setting – Data

Page 15: Topic Detection and Tracking : Event Clustering as a Basis for First Story Detection AI-Lab Jung Sung Won.

Evaluation Measures

Measures of Text classification effectiveness Recall and precision Misses : the system does not detect a new event False alarms : the system indicates a story contains a n

ew event when in truth it does not F1-Measuer (Lewis and Gale, 1994) : 2PR/(P+R) TDT Cost function

P(fa) : the system false alarm rate P(m) : miss probability P(topic) : the prior probability that a story is relevant to a t

opic costfa = costm = 1.0

)(*)(*cos))(1(*)(*cos topicPmPttopicPfaPtCost mfa

Page 16: Topic Detection and Tracking : Event Clustering as a Basis for First Story Detection AI-Lab Jung Sung Won.

Event Clustering

Page 17: Topic Detection and Tracking : Event Clustering as a Basis for First Story Detection AI-Lab Jung Sung Won.

First Story Detection

New topic : a topic whose event has not been previously reported

Motivation The property of time as a distinguishing feature of

this domain The name of the people, places, dates, and things :

who, what, when, where Method

Use event clustering method If no classifier comparison results in a positive

classification decision for the current story, then the current story has content not previously encountered, and thus it contains discussion of a new topic

Difference : finding the start of each topic On-line single link + time strategy

Page 18: Topic Detection and Tracking : Event Clustering as a Basis for First Story Detection AI-Lab Jung Sung Won.

First Story Detection Experiment

Page 19: Topic Detection and Tracking : Event Clustering as a Basis for First Story Detection AI-Lab Jung Sung Won.

Discussion of First Story Detection

The Good News Low false alarm rates

The Bad News 경험적으로 점진적인 증가만을 기대할 수 있다 . The limitation of the word-co-occurrence model Association with topics that heavily covered in

the news