Visual Explanations Brian McNurlen, CITES, UIUC Jim Witte, ATLAS, UIUC.
Robust Pseudo Feedback & HMM Passage Extraction UIUC at TREC 2006 Genomics Track
description
Transcript of Robust Pseudo Feedback & HMM Passage Extraction UIUC at TREC 2006 Genomics Track
Robust Pseudo Feedback& HMM Passage Extraction
UIUC at TREC 2006 Genomics Track
Jing Jiang, Xin He, ChengXiang ZhaiUniversity of Illinois at Urbana-Champaign
11/16/06 2
Goal of Participation
• To test the effectiveness of some recent language modeling methods for genomics retrieval– Robust pseudo feedback [Tao & Zhai 06]
– HMM passage extraction [Jiang & Zhai 06]
• Task at 2006 genomics track– Document-level retrieval– Passage-level retrieval– Aspect-level retrieval
11/16/06 3
Overall Approach
QDocument Retrieval Module
1
Medline articles paragraphs
Passage Extraction
Module2
k
…
1 2 k…
…
…
ranked paragraphs
ranked passages
user relevance feedback
pseudo relevance feedback
11/16/06 4
Goal of Participation
• To test the effectiveness of some recent language modeling methods for genomics retrieval– Robust pseudo feedback [Tao & Zhai 06]
– HMM passage extraction [Jiang & Zhai 06]
11/16/06 5
KL-Divergence Retrieval Model[Lafferty & Zhai 01]
role 0.2prnp 0.2mad 0.2cow 0.2diseas 0.2
the 0.020for 0.015prp 0.102mad 0.034cow 0.034diseas 0.068… …
topic
document
D2
D1
Dk
…
…
The…for… spongiform…PrP protein…
Prion diseases… that…(PrP C)…This…
…which…(PrP C)…to the…prion protein…
11/16/06 6
KL-Divergence Retrieval Model[Lafferty & Zhai 01]
role 0.2prnp 0.2mad 0.2cow 0.2diseas 0.2
the 0.020for 0.015prp 0.102mad 0.034cow 0.034diseas 0.068… …
document
D2
D1
Dk
The…for… spongiform…PrP protein…
Prion diseases… that…(PrP C)…This…
…which…(PrP C)…to the…prion protein…
topic …
…
11/16/06 7
Model-Based Feedback[Zhai & Lafferty 01]
role 0.2prnp 0.2mad 0.2cow 0.2diseas 0.2
D2
D1
Dk
The…for… spongiform…PrP protein…
Prion diseases… that…(PrP C)…This…
…which…(PrP C)…to the…prion protein…
topic
the ?for ?… …prp ?prion ?
feedback
the 0.02for 0.01… …prp 0.003prion 0.004
background
…
…
11/16/06 8
Model-Based Feedback[Zhai & Lafferty 01]
role 0.2prnp 0.2mad 0.2cow 0.2diseas 0.2
D2
D1
Dk
The…for… spongiform…PrP protein…
Prion diseases… that…(PrP C)…This…
…which…(PrP C)…to the…prion protein…
topic
the 0.003for 0.002… …prp 0.02prion 0.05
feedback
the 0.02for 0.01… …prp 0.003prion 0.004
background
…
…
EM algorithm
11/16/06 9
Model-Based Feedback[Zhai & Lafferty 01]
role 0.2prnp 0.2mad 0.2cow 0.2diseas 0.2
D2
D1
Dk
The…for… spongiform…PrP protein…
Prion diseases… that…(PrP C)…This…
…which…(PrP C)…to the…prion protein…
topic
the 0.003for 0.002… …prp 0.02prion 0.05
feedback
the 0.02for 0.01… …prp 0.003prion 0.004
background
…
…2 parametersα and λ
11/16/06 10
Regularized Estimation[Tao & Zhai 06]
role 0.2prnp 0.2mad 0.2cow 0.2diseas 0.2
D2
D1
Dk
The…for… spongiform…PrP protein…
Prion diseases… that…(PrP C)…This…
…which…(PrP C)…to the…prion protein…
topic
the ?for ?… …prp ?prion ?
feedback
the 0.02for 0.01… …prp 0.003prion 0.004
background
…
…
11/16/06 11
Regularized Estimation[Tao & Zhai 06]
role 0.2prnp 0.2mad 0.2cow 0.2diseas 0.2
D2
D1
Dk
The…for… spongiform…PrP protein…
Prion diseases… that…(PrP C)…This…
…which…(PrP C)…to the…prion protein…
topic
the 0.003for 0.002… …prp 0.02prion 0.05
feedback
the 0.02for 0.01… …prp 0.003prion 0.004
background
prior
regularized EM
algorithm
…
…
11/16/06 12
Regularized Estimation[Tao & Zhai 06]
role 0.2prnp 0.2mad 0.2cow 0.2diseas 0.2
D2
D1
Dk
The…for… spongiform…PrP protein…
Prion diseases… that…(PrP C)…This…
…which…(PrP C)…to the…prion protein…
topic
the 0.003for 0.002… …prp 0.02prion 0.05
feedback
the 0.02for 0.01… …prp 0.003prion 0.004
background
prior
…
…1 parameter η
11/16/06 13
…
D1
D2
Dk
Original vs. Regularized EMoriginal regularized
…
D1
D2
Dk
α
…
D1
D2
Dk
α
α
α dynamically set
α manually set
11/16/06 14
Goal of Participation
• To test the effectiveness of some recent language modeling methods for genomics retrieval– Robust pseudo feedback [Tao & Zhai 06]
– HMM passage extraction [Jiang & Zhai 06]
11/16/06 15
HMM Passage Extraction[Jiang & Zhai 06]
p(w|B1) the: 0.02 for: 0.01 prp: 0.001 …
p(w|R) the: 0.003 for: 0.002 prp: 0.02 …
p(w|B2) the: 0.02 for: 0.01 prp: 0.001 …
B1 R B2p(R|B1)
= 0.1p(B2|R)= 0.05
p(B1|B1)= 0.9
p(R|R)= 0.95
p(B2|B2)= 1
HMM
B R…B B …R R R R B … BR
relevant passage
w w…w w …w w w w w … ww
paragraph
11/16/06 16
HMM Passage Extraction[Jiang & Zhai 06]
B2
B1 R B3 E
a background state for smoothing
end-of-paragraphstate
transition probabilities estimated from observations
11/16/06 17
Experiment Design
• Pre-processing– HTML parsing– paragraph boundaries – Tokenization
• User relevance feedback
11/16/06 18
Official Runs
Q KL-Div Retrieval
1
Medline articles paragraphs
HMM Passage
Extraction2
k
…
1 2 k…
…
…
ranked paragraphs
ranked passages
Q'
11/16/06 19
UIUCauto
Q KL-Div Retrieval
1
Medline articles paragraphs
HMM Passage
Extraction2
k
…
1 2 k…
…
…
ranked paragraphs
ranked passages
Q'
regularized estimation
11/16/06 20
UIUCinter
Q KL-Div Retrieval
1
Medline articles paragraphs
HMM Passage
Extraction2
k
…
1 2 k…
…
…
ranked paragraphs
ranked passages
regularized estimation
Q'
11/16/06 21
UIUCinter2
Q KL-Div Retrieval
1
Medline articles paragraphs
HMM Passage
Extraction2
k
…
1 2 k…
…
…
ranked paragraphs
ranked passages
original estimation
Q'F
11/16/06 22
Pseudo Relevance Feedback(k = 10)
Method Doc MAP Rel. Impr.
Baseline (no feedback) 0.3484 N/A
Original Estimation
Def 0.3606 +3.50%
Opt 0.3943 +13.2%
Regularized Estimation
Def0.3842
(UIUCauto)+10.3%
Opt 0.3952 +13.4%
η is similar to λ / (1 − λ)
11/16/06 23
Pseudo Relevance Feedback(k = 10)
η is similar to λ / (1 − λ)
Method Doc MAP Rel. Impr.
Baseline (no feedback) 0.3484 N/A
Original Estimation
Def 0.3606 +3.50%
Opt 0.3943 +13.2%
Regularized Estimation
Def0.3842
(UIUCauto)+10.3%
Opt 0.3952 +13.4%
11/16/06 24
Pseudo Relevance Feedback(k = 10)
Method Doc MAP Rel. Impr.
Baseline (no feedback) 0.3484 N/A
Original Estimation
Def 0.3606 +3.50%
Opt 0.3943 +13.2%
Regularized Estimation
Def0.3842
(UIUCauto)+10.3%
Opt 0.3952 +13.4%
η is similar to λ / (1 − λ)
11/16/06 25
Parameter Sensitivity(pseudo feedback, k = 10)
11/16/06 26
User Relevance Feedback
MethodDoc MAP
Pseudo Feedback
User Feedback
Rel. Impr.
Original Estimation
Def 0.3606 0.3986 +10.5%
Opt 0.3943 0.4511 +14.4%
Regularized Estimation
Def0.3842
(UIUCauto)0.4261
(UIUCinter)+10.9%
Opt 0.3952 0.4515 +14.2%
11/16/06 27
User Relevance Feedback
MethodDoc MAP
Pseudo Feedback
User Feedback
Rel. Impr.
Original Estimation
Def 0.3606 0.3986 +10.5%
Opt 0.3943 0.4511 +14.4%
Regularized Estimation
Def0.3842
(UIUCauto)0.4261
(UIUCinter)+10.9%
Opt 0.3952 0.4515 +14.2%
11/16/06 28
User Relevance Feedback
MethodDoc MAP
Pseudo Feedback
User Feedback
Rel. Impr.
Original Estimation
Def 0.3606 0.3986 +10.5%
Opt 0.3943 0.4511 +14.4%
Regularized Estimation
Def0.3842
(UIUCauto)0.4261
(UIUCinter)+10.9%
Opt 0.3952 0.4515 +14.2%
11/16/06 29
HMM Passage Extraction
Method Psg MAP
UIUCauto
Paragraph 0.03753
HMM Passage 0.04864
Rel. Impr. +29.6%
UIUCinter
Paragraph 0.04481
HMM Passage 0.05906
Rel. Impr. +31.8%
UIUCinter2
Paragraph 0.04580
HMM Passage 0.06038
Rel. Impr. +31.8%
11/16/06 30
Passage Length (In Bytes)
Max Min Avg Std
True Passages 6928 27 399.8 489.4
HMM Passages 6955 34 1525.8 949.7
Paragraph 8670 60 2105.4 1136.8
HMM passages are generally too long!
11/16/06 31
Example PassagePrion diseases, which include Creutzfeldt-Jacob disease in humans, mad cow disease in cattle, and scrapie in sheep, involve the misfolding of the benign cellular prion protein (PrP C) 1 to the infectious disease-causing scrapie isoform PrP Sc. The prion protein (PrP C) is a copper-binding cell surface glycoprotein. The role of copper in the normal function of PrP, as well as in prion diseases, has been the subject of a number of excellent reviews. The mature cellular form of PrP consists of residues 23 to 231 and is tethered to the cell surface via a glycosylphosphatidylinositol anchor at the C terminus. There are now a number of NMR solution structures of copper-free mammalian PrPs. A crystal structure of PrP C has also been published; this structure is dimeric involving domain swapping of the monomeric form.
11/16/06 32
Example PassagePrion diseases, which include Creutzfeldt-Jacob disease in humans, mad cow disease in cattle, and scrapie in sheep, involve the misfolding of the benign cellular prion protein (PrP C) 1 to the infectious disease-causing scrapie isoform PrP Sc. The prion protein (PrP C) is a copper-binding cell surface glycoprotein. The role of copper in the normal function of PrP, as well as in prion diseases, has been the subject of a number of excellent reviews. The mature cellular form of PrP consists of residues 23 to 231 and is tethered to the cell surface via a glycosylphosphatidylinositol anchor at the C terminus. There are now a number of NMR solution structures of copper-free mammalian PrPs. A crystal structure of PrP C has also been published; this structure is dimeric involving domain swapping of the monomeric form.
11/16/06 33
Conclusions and Future Work• The two language modeling methods in general
works well in genomics domain– Regularized feedback estimation can effectively
eliminates parameter α– HMM passages improves over paragraphs
• User relevance feedback is effective• Limitations and future work
– Regularized feedback estimation still has parameter η to tune
• How to eliminate η?
– The inherent coherence property of HMM passages may not suit the task well
• Different/better HMM architecture?
11/16/06 34
The End
• Questions?