Robust Pseudo Feedback & HMM Passage Extraction UIUC at TREC 2006 Genomics Track

34
Robust Pseudo Feedback & HMM Passage Extraction UIUC at TREC 2006 Genomics Track Jing Jiang, Xin He, ChengXiang Zhai University of Illinois at Urbana- Champaign

description

Robust Pseudo Feedback & HMM Passage Extraction UIUC at TREC 2006 Genomics Track. Jing Jiang, Xin He, ChengXiang Zhai University of Illinois at Urbana-Champaign. Goal of Participation. To test the effectiveness of some recent language modeling methods for genomics retrieval - PowerPoint PPT Presentation

Transcript of Robust Pseudo Feedback & HMM Passage Extraction UIUC at TREC 2006 Genomics Track

Page 1: Robust Pseudo Feedback & HMM Passage Extraction UIUC at TREC 2006 Genomics Track

Robust Pseudo Feedback& HMM Passage Extraction

UIUC at TREC 2006 Genomics Track

Jing Jiang, Xin He, ChengXiang ZhaiUniversity of Illinois at Urbana-Champaign

Page 2: Robust Pseudo Feedback & HMM Passage Extraction UIUC at TREC 2006 Genomics Track

11/16/06 2

Goal of Participation

• To test the effectiveness of some recent language modeling methods for genomics retrieval– Robust pseudo feedback [Tao & Zhai 06]

– HMM passage extraction [Jiang & Zhai 06]

• Task at 2006 genomics track– Document-level retrieval– Passage-level retrieval– Aspect-level retrieval

Page 3: Robust Pseudo Feedback & HMM Passage Extraction UIUC at TREC 2006 Genomics Track

11/16/06 3

Overall Approach

QDocument Retrieval Module

1

Medline articles paragraphs

Passage Extraction

Module2

k

1 2 k…

ranked paragraphs

ranked passages

user relevance feedback

pseudo relevance feedback

Page 4: Robust Pseudo Feedback & HMM Passage Extraction UIUC at TREC 2006 Genomics Track

11/16/06 4

Goal of Participation

• To test the effectiveness of some recent language modeling methods for genomics retrieval– Robust pseudo feedback [Tao & Zhai 06]

– HMM passage extraction [Jiang & Zhai 06]

Page 5: Robust Pseudo Feedback & HMM Passage Extraction UIUC at TREC 2006 Genomics Track

11/16/06 5

KL-Divergence Retrieval Model[Lafferty & Zhai 01]

role 0.2prnp 0.2mad 0.2cow 0.2diseas 0.2

the 0.020for 0.015prp 0.102mad 0.034cow 0.034diseas 0.068… …

topic

document

D2

D1

Dk

The…for… spongiform…PrP protein…

Prion diseases… that…(PrP C)…This…

…which…(PrP C)…to the…prion protein…

Page 6: Robust Pseudo Feedback & HMM Passage Extraction UIUC at TREC 2006 Genomics Track

11/16/06 6

KL-Divergence Retrieval Model[Lafferty & Zhai 01]

role 0.2prnp 0.2mad 0.2cow 0.2diseas 0.2

the 0.020for 0.015prp 0.102mad 0.034cow 0.034diseas 0.068… …

document

D2

D1

Dk

The…for… spongiform…PrP protein…

Prion diseases… that…(PrP C)…This…

…which…(PrP C)…to the…prion protein…

topic …

Page 7: Robust Pseudo Feedback & HMM Passage Extraction UIUC at TREC 2006 Genomics Track

11/16/06 7

Model-Based Feedback[Zhai & Lafferty 01]

role 0.2prnp 0.2mad 0.2cow 0.2diseas 0.2

D2

D1

Dk

The…for… spongiform…PrP protein…

Prion diseases… that…(PrP C)…This…

…which…(PrP C)…to the…prion protein…

topic

the ?for ?… …prp ?prion ?

feedback

the 0.02for 0.01… …prp 0.003prion 0.004

background

Page 8: Robust Pseudo Feedback & HMM Passage Extraction UIUC at TREC 2006 Genomics Track

11/16/06 8

Model-Based Feedback[Zhai & Lafferty 01]

role 0.2prnp 0.2mad 0.2cow 0.2diseas 0.2

D2

D1

Dk

The…for… spongiform…PrP protein…

Prion diseases… that…(PrP C)…This…

…which…(PrP C)…to the…prion protein…

topic

the 0.003for 0.002… …prp 0.02prion 0.05

feedback

the 0.02for 0.01… …prp 0.003prion 0.004

background

EM algorithm

Page 9: Robust Pseudo Feedback & HMM Passage Extraction UIUC at TREC 2006 Genomics Track

11/16/06 9

Model-Based Feedback[Zhai & Lafferty 01]

role 0.2prnp 0.2mad 0.2cow 0.2diseas 0.2

D2

D1

Dk

The…for… spongiform…PrP protein…

Prion diseases… that…(PrP C)…This…

…which…(PrP C)…to the…prion protein…

topic

the 0.003for 0.002… …prp 0.02prion 0.05

feedback

the 0.02for 0.01… …prp 0.003prion 0.004

background

…2 parametersα and λ

Page 10: Robust Pseudo Feedback & HMM Passage Extraction UIUC at TREC 2006 Genomics Track

11/16/06 10

Regularized Estimation[Tao & Zhai 06]

role 0.2prnp 0.2mad 0.2cow 0.2diseas 0.2

D2

D1

Dk

The…for… spongiform…PrP protein…

Prion diseases… that…(PrP C)…This…

…which…(PrP C)…to the…prion protein…

topic

the ?for ?… …prp ?prion ?

feedback

the 0.02for 0.01… …prp 0.003prion 0.004

background

Page 11: Robust Pseudo Feedback & HMM Passage Extraction UIUC at TREC 2006 Genomics Track

11/16/06 11

Regularized Estimation[Tao & Zhai 06]

role 0.2prnp 0.2mad 0.2cow 0.2diseas 0.2

D2

D1

Dk

The…for… spongiform…PrP protein…

Prion diseases… that…(PrP C)…This…

…which…(PrP C)…to the…prion protein…

topic

the 0.003for 0.002… …prp 0.02prion 0.05

feedback

the 0.02for 0.01… …prp 0.003prion 0.004

background

prior

regularized EM

algorithm

Page 12: Robust Pseudo Feedback & HMM Passage Extraction UIUC at TREC 2006 Genomics Track

11/16/06 12

Regularized Estimation[Tao & Zhai 06]

role 0.2prnp 0.2mad 0.2cow 0.2diseas 0.2

D2

D1

Dk

The…for… spongiform…PrP protein…

Prion diseases… that…(PrP C)…This…

…which…(PrP C)…to the…prion protein…

topic

the 0.003for 0.002… …prp 0.02prion 0.05

feedback

the 0.02for 0.01… …prp 0.003prion 0.004

background

prior

…1 parameter η

Page 13: Robust Pseudo Feedback & HMM Passage Extraction UIUC at TREC 2006 Genomics Track

11/16/06 13

D1

D2

Dk

Original vs. Regularized EMoriginal regularized

D1

D2

Dk

α

D1

D2

Dk

α

α

α dynamically set

α manually set

Page 14: Robust Pseudo Feedback & HMM Passage Extraction UIUC at TREC 2006 Genomics Track

11/16/06 14

Goal of Participation

• To test the effectiveness of some recent language modeling methods for genomics retrieval– Robust pseudo feedback [Tao & Zhai 06]

– HMM passage extraction [Jiang & Zhai 06]

Page 15: Robust Pseudo Feedback & HMM Passage Extraction UIUC at TREC 2006 Genomics Track

11/16/06 15

HMM Passage Extraction[Jiang & Zhai 06]

p(w|B1) the: 0.02 for: 0.01 prp: 0.001 …

p(w|R) the: 0.003 for: 0.002 prp: 0.02 …

p(w|B2) the: 0.02 for: 0.01 prp: 0.001 …

B1 R B2p(R|B1)

= 0.1p(B2|R)= 0.05

p(B1|B1)= 0.9

p(R|R)= 0.95

p(B2|B2)= 1

HMM

B R…B B …R R R R B … BR

relevant passage

w w…w w …w w w w w … ww

paragraph

Page 16: Robust Pseudo Feedback & HMM Passage Extraction UIUC at TREC 2006 Genomics Track

11/16/06 16

HMM Passage Extraction[Jiang & Zhai 06]

B2

B1 R B3 E

a background state for smoothing

end-of-paragraphstate

transition probabilities estimated from observations

Page 17: Robust Pseudo Feedback & HMM Passage Extraction UIUC at TREC 2006 Genomics Track

11/16/06 17

Experiment Design

• Pre-processing– HTML parsing– paragraph boundaries – Tokenization

• User relevance feedback

Page 18: Robust Pseudo Feedback & HMM Passage Extraction UIUC at TREC 2006 Genomics Track

11/16/06 18

Official Runs

Q KL-Div Retrieval

1

Medline articles paragraphs

HMM Passage

Extraction2

k

1 2 k…

ranked paragraphs

ranked passages

Q'

Page 19: Robust Pseudo Feedback & HMM Passage Extraction UIUC at TREC 2006 Genomics Track

11/16/06 19

UIUCauto

Q KL-Div Retrieval

1

Medline articles paragraphs

HMM Passage

Extraction2

k

1 2 k…

ranked paragraphs

ranked passages

Q'

regularized estimation

Page 20: Robust Pseudo Feedback & HMM Passage Extraction UIUC at TREC 2006 Genomics Track

11/16/06 20

UIUCinter

Q KL-Div Retrieval

1

Medline articles paragraphs

HMM Passage

Extraction2

k

1 2 k…

ranked paragraphs

ranked passages

regularized estimation

Q'

Page 21: Robust Pseudo Feedback & HMM Passage Extraction UIUC at TREC 2006 Genomics Track

11/16/06 21

UIUCinter2

Q KL-Div Retrieval

1

Medline articles paragraphs

HMM Passage

Extraction2

k

1 2 k…

ranked paragraphs

ranked passages

original estimation

Q'F

Page 22: Robust Pseudo Feedback & HMM Passage Extraction UIUC at TREC 2006 Genomics Track

11/16/06 22

Pseudo Relevance Feedback(k = 10)

Method Doc MAP Rel. Impr.

Baseline (no feedback) 0.3484 N/A

Original Estimation

Def 0.3606 +3.50%

Opt 0.3943 +13.2%

Regularized Estimation

Def0.3842

(UIUCauto)+10.3%

Opt 0.3952 +13.4%

η is similar to λ / (1 − λ)

Page 23: Robust Pseudo Feedback & HMM Passage Extraction UIUC at TREC 2006 Genomics Track

11/16/06 23

Pseudo Relevance Feedback(k = 10)

η is similar to λ / (1 − λ)

Method Doc MAP Rel. Impr.

Baseline (no feedback) 0.3484 N/A

Original Estimation

Def 0.3606 +3.50%

Opt 0.3943 +13.2%

Regularized Estimation

Def0.3842

(UIUCauto)+10.3%

Opt 0.3952 +13.4%

Page 24: Robust Pseudo Feedback & HMM Passage Extraction UIUC at TREC 2006 Genomics Track

11/16/06 24

Pseudo Relevance Feedback(k = 10)

Method Doc MAP Rel. Impr.

Baseline (no feedback) 0.3484 N/A

Original Estimation

Def 0.3606 +3.50%

Opt 0.3943 +13.2%

Regularized Estimation

Def0.3842

(UIUCauto)+10.3%

Opt 0.3952 +13.4%

η is similar to λ / (1 − λ)

Page 25: Robust Pseudo Feedback & HMM Passage Extraction UIUC at TREC 2006 Genomics Track

11/16/06 25

Parameter Sensitivity(pseudo feedback, k = 10)

Page 26: Robust Pseudo Feedback & HMM Passage Extraction UIUC at TREC 2006 Genomics Track

11/16/06 26

User Relevance Feedback

MethodDoc MAP

Pseudo Feedback

User Feedback

Rel. Impr.

Original Estimation

Def 0.3606 0.3986 +10.5%

Opt 0.3943 0.4511 +14.4%

Regularized Estimation

Def0.3842

(UIUCauto)0.4261

(UIUCinter)+10.9%

Opt 0.3952 0.4515 +14.2%

Page 27: Robust Pseudo Feedback & HMM Passage Extraction UIUC at TREC 2006 Genomics Track

11/16/06 27

User Relevance Feedback

MethodDoc MAP

Pseudo Feedback

User Feedback

Rel. Impr.

Original Estimation

Def 0.3606 0.3986 +10.5%

Opt 0.3943 0.4511 +14.4%

Regularized Estimation

Def0.3842

(UIUCauto)0.4261

(UIUCinter)+10.9%

Opt 0.3952 0.4515 +14.2%

Page 28: Robust Pseudo Feedback & HMM Passage Extraction UIUC at TREC 2006 Genomics Track

11/16/06 28

User Relevance Feedback

MethodDoc MAP

Pseudo Feedback

User Feedback

Rel. Impr.

Original Estimation

Def 0.3606 0.3986 +10.5%

Opt 0.3943 0.4511 +14.4%

Regularized Estimation

Def0.3842

(UIUCauto)0.4261

(UIUCinter)+10.9%

Opt 0.3952 0.4515 +14.2%

Page 29: Robust Pseudo Feedback & HMM Passage Extraction UIUC at TREC 2006 Genomics Track

11/16/06 29

HMM Passage Extraction

Method Psg MAP

UIUCauto

Paragraph 0.03753

HMM Passage 0.04864

Rel. Impr. +29.6%

UIUCinter

Paragraph 0.04481

HMM Passage 0.05906

Rel. Impr. +31.8%

UIUCinter2

Paragraph 0.04580

HMM Passage 0.06038

Rel. Impr. +31.8%

Page 30: Robust Pseudo Feedback & HMM Passage Extraction UIUC at TREC 2006 Genomics Track

11/16/06 30

Passage Length (In Bytes)

Max Min Avg Std

True Passages 6928 27 399.8 489.4

HMM Passages 6955 34 1525.8 949.7

Paragraph 8670 60 2105.4 1136.8

HMM passages are generally too long!

Page 31: Robust Pseudo Feedback & HMM Passage Extraction UIUC at TREC 2006 Genomics Track

11/16/06 31

Example PassagePrion diseases, which include Creutzfeldt-Jacob disease in humans, mad cow disease in cattle, and scrapie in sheep, involve the misfolding of the benign cellular prion protein (PrP C) 1 to the infectious disease-causing scrapie isoform PrP Sc. The prion protein (PrP C) is a copper-binding cell surface glycoprotein. The role of copper in the normal function of PrP, as well as in prion diseases, has been the subject of a number of excellent reviews. The mature cellular form of PrP consists of residues 23 to 231 and is tethered to the cell surface via a glycosylphosphatidylinositol anchor at the C terminus. There are now a number of NMR solution structures of copper-free mammalian PrPs. A crystal structure of PrP C has also been published; this structure is dimeric involving domain swapping of the monomeric form.

Page 32: Robust Pseudo Feedback & HMM Passage Extraction UIUC at TREC 2006 Genomics Track

11/16/06 32

Example PassagePrion diseases, which include Creutzfeldt-Jacob disease in humans, mad cow disease in cattle, and scrapie in sheep, involve the misfolding of the benign cellular prion protein (PrP C) 1 to the infectious disease-causing scrapie isoform PrP Sc. The prion protein (PrP C) is a copper-binding cell surface glycoprotein. The role of copper in the normal function of PrP, as well as in prion diseases, has been the subject of a number of excellent reviews. The mature cellular form of PrP consists of residues 23 to 231 and is tethered to the cell surface via a glycosylphosphatidylinositol anchor at the C terminus. There are now a number of NMR solution structures of copper-free mammalian PrPs. A crystal structure of PrP C has also been published; this structure is dimeric involving domain swapping of the monomeric form.

Page 33: Robust Pseudo Feedback & HMM Passage Extraction UIUC at TREC 2006 Genomics Track

11/16/06 33

Conclusions and Future Work• The two language modeling methods in general

works well in genomics domain– Regularized feedback estimation can effectively

eliminates parameter α– HMM passages improves over paragraphs

• User relevance feedback is effective• Limitations and future work

– Regularized feedback estimation still has parameter η to tune

• How to eliminate η?

– The inherent coherence property of HMM passages may not suit the task well

• Different/better HMM architecture?

Page 34: Robust Pseudo Feedback & HMM Passage Extraction UIUC at TREC 2006 Genomics Track

11/16/06 34

The End

• Questions?