Empirical Development of an Exponential Probabilistic Model Using Textual Analysis to Build a Better...

Empirical Development of anExponential Probabilistic Model

Using Textual Analysis to Build a Better Model

Jaime Teevan & David R. KargerCSAIL (LCS+AI), MIT

Goal: Better Generative Model

Generative v. discriminative modelApplies to many applications Information retrieval (IR)

Relevance feedback Using unlabeled data

Classification

Assumptions explicit

Using a Model for IR

1. Define model2. Learn parameters from query3. Rank documents

Hyper-learn

• Better model improves applications Trickle down to improve retrieval Classification, relevance feedback, …

• Corpus specific models

Overview

Related workProbabilistic models Example: Poisson Model Compare model to text

Hyper-learning the model Exponential framework Investigate retrieval performance

Conclusion and future work

Related Work

Using text for retrieval algorithm [Jones, 1972], [Greiff, 1998]

Using text to model text [Church & Gale, 1995], [Katz, 1996]

Learning model parameters [Zhai & Lafferty, 2002]

Hyper-learn the model from text!

Probabilistic Models

Rank documents by RV = Pr(rel|d)

Naïve Bayesian models

RV = Pr(rel|d)

Probabilistic Models

Rank documents by RV = Pr(rel|d)

Naïve Bayesian models

= Pr(dt|rel) features t

RV = Pr(rel|d) 8Open assumptionsFeature definitionFeature distribution family

# occs in doc

Defines the model!

Pr(d|rel)

Using a Naïve Bayesian Model

Pr(dt|rel) =

Pr(dt|rel) = θ e -θ

dtPoisson Model

θ: specifies term distribution

1E-171E-15

1E-111E-09

1E-071E-05

0 1 2 3 4 5

Poisson

Term occurs exactly dt

d t|rel)

Example Poisson Distribution

θ=0.0006

Pr(dt|rel)≈1E-15

Learn a θ for each term

Maximum likelihood θ Term’s average number of occurrence

Incorporate prior expectations

For each document, find RV

Sort documents by RV

= Pr(dt|rel)

. words t

For each document, find RV

Sort documents by RV

= Pr(dt|rel)

. words t

Which step goes wrong?

Pr(dt|rel) = θ e -θ

1E-171E-15

1E-111E-09

1E-071E-05

0 1 2 3 4 5

DataPoisson

d t|rel)

How Good is the Model?

θ=0.0006

15 times

How Good is the Model?

1E-171E-15

1E-111E-09

1E-071E-05

0 1 2 3 4 5

DataPoisson

d t|rel)

θ=0.0006

15 times

Misfit!

Hyper-learning a Better FitThrough Textual Analysis

Using an Exponential Framework

Need framework for hyper-learning

Bernoulli

Poisson

Normal

Mixtures

Hyper-Learning Framework

Need framework for hyper-learning

Goal: Same benefits as Poisson Model One parameter Easy to work with (e.g., prior)

Bernoulli

Poisson

Normal

One parameter exponential families

Mixtures

Hyper-Learning Framework

Well understood, learning easy [Bernardo & Smith, 1994], [Gous, 1998]

Pr( dt | rel ) = f(dt) g(θ) e

Functions f(dt) and h(dt) specify family E.g., Poisson: f(dt) = (dt!)-1, h(dt) = dt

Parameter θ term’s specific distribution

Exponential Framework

θ h(dt)

Using a Hyper-learned Model

1. Hyper-learn model2. Learn parameters from query3. Rank documents

Want “best” f(dt) and h(dt)

Iterative hill climbing Local maximum Poisson starting point

Data: TREC query result sets Past queries to learn about future queries

Hyper-learn and test with different sets

Recall the Poisson Distribution

1E-171E-15

1E-111E-09

1E-071E-05

0 1 2 3 4 5

DataPoissonNew Model

d t|rel)

15 times

Poisson Starting Point - h(dt)

0 1 2 3 4 5

PoissonLearned

Pr(dt|rel) = f(dt) g(θ) eθ h(dt)

0 1 2 3 4 5

PoissonLearned

Hyper-learned Model - h(dt)Hyper-learned Model - h(dt)+

Pr(dt|rel) = f(dt) g(θ) eθ h(dt)

Poisson Distribution

1E-171E-15

1E-111E-09

1E-071E-05

0 1 2 3 4 5

d t|rel)

15 times

1E-171E-15

1E-111E-09

1E-071E-05

0 1 2 3 4 5

Hyper-learned Distribution

15 times

Hyper-learned Distribution+

d t|rel)

1E-171E-15

1E-111E-09

1E-071E-05

0 1 2 3 4 5

5 times

Hyper-learned DistributionHyper-learned Distribution+

d t|rel)

1E-171E-15

1E-131E-11

1E-091E-07

1E-050.001

0 1 2 3 4 5

30 times

d t|rel)

1E-171E-15

1E-111E-09

1E-071E-05

0 1 2 3 4 5

300 times

d t|rel)

Performing Retrieval

Pr( dt | rel ) = f(dt) g(θ) e

Learn θ for each term

θ h(dt)

Labeled docs

Learning θ

Sufficient statistics Summarize all observed data τ1: # of observations τ2: Σobservations d h(dt)

Incorporating prior easy

Map τ1 and τ2 θ

20 labeled documents

PoissonNew Model

Recall

Results: Labeled DocumentsResults: Labeled Documents

PoissonNew Model

Recall

Results: Labeled DocumentsResults: Labeled Documents

Short query

Query = single labeled documentVector space-like equation

RV = Σ a(t, d) + Σ b(q, d)

Problem: Document dominatesSolution: Use only query portion Another solution: Normalize

Retrieval: Query

t in doc q in query

Retrieval: Query

PoissonNew ModelTF.IDF

Recall

Retrieval: Query

Recall

Retrieval: Query

Recall

Retrieval: Query

Conclusion

Probabilistic models Example: Poisson Model

Hyper-learning the model Exponential framework Learned a better model Investigate retrieval performance

- Easy to work with

- Better …

- Bad text model

- Heavy tailed!

Use model betterUse for other applications Other IR applications Classification

Correct for document lengthHyper-learn on different corpora Test if learned model generalizes Different for genre? Language?

People?

Hyper-learn model better

Future Work

Questions?

Contact us with questions:

Jaime Teevanteevan@ai.mit.edu

David Kargerkarger@theory.lcs.mit.edu

Empirical Development of an Exponential Probabilistic Model Using Textual Analysis to Build a Better...

Documents

Transcript of Empirical Development of an Exponential Probabilistic Model Using Textual Analysis to Build a Better...

Surviving the Information Explosion Jaime Teevan, MIT with Christine Alvarado, Mark Ackerman and David Karger.

Pregel - MIT CSAIL

Copyright © 2013 Karger

lecture23 - People | MIT CSAIL

Slide 1people.csail.mit.edu/teevan/work/publications/talks/sigird… · PPT file · Web viewThe Re:Search Engine Jaime Teevan MIT, CSAIL People Forget a Lot Change Blindness Change

KACIMI Teevan Terre Sainte version coup e - ETC-CTE file1 Holy Land by Mohamed Kacimi translated by Colin Teevan This play was translated in the context of TRAMES, a project organised

David Karger Sewoong Oh Devavrat Shah MIT + UIUC.

Mental Sentences Burley Early Ockham_ Karger

p601 Karger

David Karger

BEYONDMATRIXCOMPLETION& - People | MIT CSAIL

The Perfect Search Engine Is Not Enough Jaime Teevan †, Christine Alvarado †, Mark S. Ackerman ‡ and David R. Karger † † MIT, CSAIL ‡ University of Michigan.

Braunstein v. Karger, 1st Cir. (1992)

Forensic Ballistics Karger

Section 2: Finding and Refinding Jaime Teevan Microsoft Research 1.

THE WEB CHANGES EVERYTHING Jaime Teevan, Microsoft

Less is More Probabilistic Models for Retrieving Fewer Relevant Documents Harr Chen, David R. Karger MIT CSAIL ACM SIGIR 2006 August 9, 2006.

Counting with the Crowd - VLDB · Counting with the Crowd Adam Marcus David Karger Samuel Madden Robert Miller Sewoong Oh MIT CSAIL fmarcua,karger,madden,rcmg@csail.mit.edu, swoh@illinois.edu

Publishing Your Research Holly Rushmeier Jaime Teevan.

Less is More Probabilistic Model for Retrieving Fewer Relevant Docuemtns Harr Chen and David R. Karger MIT CSAIL SIGIR2006 4/30/2007.