UMass at TDT 2000 James Allan and Victor Lavrenko (with David Frey and Vikas Khandelwal) Center for...

14
UMass at TDT 2000 James Allan and Victor Lavrenko (with David Frey and Vikas Khandelwal) Center for Intelligent Information Retrieval Department of Computer Science University of Massachusetts, Amherst

Transcript of UMass at TDT 2000 James Allan and Victor Lavrenko (with David Frey and Vikas Khandelwal) Center for...

Page 1: UMass at TDT 2000 James Allan and Victor Lavrenko (with David Frey and Vikas Khandelwal) Center for Intelligent Information Retrieval Department of Computer.

UMass at TDT 2000

James Allan and Victor Lavrenko(with David Frey and Vikas Khandelwal)

Center for Intelligent Information RetrievalDepartment of Computer Science

University of Massachusetts, Amherst

Page 2: UMass at TDT 2000 James Allan and Victor Lavrenko (with David Frey and Vikas Khandelwal) Center for Intelligent Information Retrieval Department of Computer.

Work on Story Link Detection

• Active work on SLD– Not ready in time for official submission

• Story “smoothing” using query expansion

• Score normalization based on language pair

Page 3: UMass at TDT 2000 James Allan and Victor Lavrenko (with David Frey and Vikas Khandelwal) Center for Intelligent Information Retrieval Department of Computer.

What is LCA?

• Local Context Analysis– Query expansion technique from IR– More stable than other “pseudo RF”

approaches– Application for more than document retrieval

• Basic idea– Retrieve a set of passages similar to query– Mine those passages for words near query

• Ad-hoc weighting designed to do that

– Add words to query and re-run

Page 4: UMass at TDT 2000 James Allan and Victor Lavrenko (with David Frey and Vikas Khandelwal) Center for Intelligent Information Retrieval Department of Computer.

LCA for story smoothing

• Convert story to a weighted vector– Inquery weights (incl. Okapi tf component)

• Select top 100 most highly weighted terms• Find top 20 stories most similar (cosine)• Weight all terms in top 20 stories (LCA)• Select top 100 LCA expansion terms• Add to story (decaying weights from 1.0)• Story now represented by 100-200 terms• Compare smoothed story vectors

Page 5: UMass at TDT 2000 James Allan and Victor Lavrenko (with David Frey and Vikas Khandelwal) Center for Intelligent Information Retrieval Department of Computer.

Smoothing SLD with LCA

• Run on training data (english)

• Green line is no smoothing

• Blue is smooth with past stories

• Pink is smooth with whole corpus (cheating)

Page 6: UMass at TDT 2000 James Allan and Victor Lavrenko (with David Frey and Vikas Khandelwal) Center for Intelligent Information Retrieval Department of Computer.

Work on Story Link Detection

• Story “smoothing” using query expansion

• Score normalization based on language pair

Page 7: UMass at TDT 2000 James Allan and Victor Lavrenko (with David Frey and Vikas Khandelwal) Center for Intelligent Information Retrieval Department of Computer.

Score normalization

• Noticed that SYSTRAN documents were throwing scores off substantially– Multilingual SLD was much worse that ENG

only

• Look at distribution of scores in same-topic and different-topic pairs

Page 8: UMass at TDT 2000 James Allan and Victor Lavrenko (with David Frey and Vikas Khandelwal) Center for Intelligent Information Retrieval Department of Computer.

Score distributions, same topic

EEMM

ME

Page 9: UMass at TDT 2000 James Allan and Victor Lavrenko (with David Frey and Vikas Khandelwal) Center for Intelligent Information Retrieval Department of Computer.

Score distributions, diff topic

EE MM

ME

Page 10: UMass at TDT 2000 James Allan and Victor Lavrenko (with David Frey and Vikas Khandelwal) Center for Intelligent Information Retrieval Department of Computer.

Clearly need to normalize

• SYSTRAN stories use different vocabulary– Stories are much more likely to be alike– And much less likely to be like true English

• Develop normalization based on whether within or cross-language

• Convert scores into probabilities– Use distribution plots for each case

Page 11: UMass at TDT 2000 James Allan and Victor Lavrenko (with David Frey and Vikas Khandelwal) Center for Intelligent Information Retrieval Department of Computer.

Combined distribution (before normalization)

Same topic

Diff. topic

Page 12: UMass at TDT 2000 James Allan and Victor Lavrenko (with David Frey and Vikas Khandelwal) Center for Intelligent Information Retrieval Department of Computer.

After normalization(on same data--”cheating”)

Same topic

Diff. topic

Probabilities!

Page 13: UMass at TDT 2000 James Allan and Victor Lavrenko (with David Frey and Vikas Khandelwal) Center for Intelligent Information Retrieval Department of Computer.

DET plots from normalization

• Huge change in distributions

• Less pronounced change in DET plot

Page 14: UMass at TDT 2000 James Allan and Victor Lavrenko (with David Frey and Vikas Khandelwal) Center for Intelligent Information Retrieval Department of Computer.

Conclusions

• Story smoothing with LCA works– Need to “smooth” with all stories before later– Need to use different matching for smoothing

and then story-story comparison

• Score normalization has potential– Other sites have found similar effects– Experiments on source-type (audio, newswire)

within language pairs have been inconclusive• Not much training data for doing conversion