Learning Ensembles of First-Order Clauses for Recall-Precision Curves A Case Study in Biomedical...
-
Upload
scot-richards -
Category
Documents
-
view
228 -
download
0
Transcript of Learning Ensembles of First-Order Clauses for Recall-Precision Curves A Case Study in Biomedical...
![Page 1: Learning Ensembles of First-Order Clauses for Recall-Precision Curves A Case Study in Biomedical Information Extraction Mark Goadrich, Louis Oliphant and.](https://reader036.fdocuments.in/reader036/viewer/2022081503/56649eb65503460f94bbeef9/html5/thumbnails/1.jpg)
Learning Ensembles ofFirst-Order Clauses for Recall-Precision Curves
A Case Study inBiomedical Information Extraction
Mark Goadrich, Louis Oliphant and Jude ShavlikDepartment of Computer Sciences
University of Wisconsin – Madison USA6 Sept 2004
![Page 2: Learning Ensembles of First-Order Clauses for Recall-Precision Curves A Case Study in Biomedical Information Extraction Mark Goadrich, Louis Oliphant and.](https://reader036.fdocuments.in/reader036/viewer/2022081503/56649eb65503460f94bbeef9/html5/thumbnails/2.jpg)
Talk Outline Link Learning and ILP Our Gleaner Approach Aleph Ensembles Biomedical Information Extraction Evaluation and Results Future Work
![Page 3: Learning Ensembles of First-Order Clauses for Recall-Precision Curves A Case Study in Biomedical Information Extraction Mark Goadrich, Louis Oliphant and.](https://reader036.fdocuments.in/reader036/viewer/2022081503/56649eb65503460f94bbeef9/html5/thumbnails/3.jpg)
ILP Domains Object Learning
Trains, Carcinogenesis
Link Learning Binary predicates
![Page 4: Learning Ensembles of First-Order Clauses for Recall-Precision Curves A Case Study in Biomedical Information Extraction Mark Goadrich, Louis Oliphant and.](https://reader036.fdocuments.in/reader036/viewer/2022081503/56649eb65503460f94bbeef9/html5/thumbnails/4.jpg)
Link Learning Large skew toward negatives
500 relational objects 5000 positive links means 245,000 negative links
Difficult to measure success Always negative classifier is 98% accurate ROC curves look overly optimistic
Enormous quantity of data 4,285,199,774 web pages indexed by Google PubMed includes over 15 million citations
![Page 5: Learning Ensembles of First-Order Clauses for Recall-Precision Curves A Case Study in Biomedical Information Extraction Mark Goadrich, Louis Oliphant and.](https://reader036.fdocuments.in/reader036/viewer/2022081503/56649eb65503460f94bbeef9/html5/thumbnails/5.jpg)
Our Approach Develop fast ensemble algorithms focused
on recall and precision evaluation Key Ideas of Gleaner
Keep wide range of clauses Create separate theories for different recall ranges
Evaluation Area Under Recall Precision Curve (AURPC) Time = Number of clauses considered
![Page 6: Learning Ensembles of First-Order Clauses for Recall-Precision Curves A Case Study in Biomedical Information Extraction Mark Goadrich, Louis Oliphant and.](https://reader036.fdocuments.in/reader036/viewer/2022081503/56649eb65503460f94bbeef9/html5/thumbnails/6.jpg)
Gleaner - Background Focus evaluation on positive examples
Recall =
Precision =
Rapid Random Restart (Zelezny et al ILP 2002) Stochastic selection of starting clause Time-limited local heuristic search We store variety of clauses (based on recall)
FNTP
TP
FPTP
TP
![Page 7: Learning Ensembles of First-Order Clauses for Recall-Precision Curves A Case Study in Biomedical Information Extraction Mark Goadrich, Louis Oliphant and.](https://reader036.fdocuments.in/reader036/viewer/2022081503/56649eb65503460f94bbeef9/html5/thumbnails/7.jpg)
Gleaner - LearningP
reci
sion
Recall
Create B Bins Generate Clauses Record Best Repeat for K seeds
![Page 8: Learning Ensembles of First-Order Clauses for Recall-Precision Curves A Case Study in Biomedical Information Extraction Mark Goadrich, Louis Oliphant and.](https://reader036.fdocuments.in/reader036/viewer/2022081503/56649eb65503460f94bbeef9/html5/thumbnails/8.jpg)
Gleaner - Combining Combine K clauses per bin
If at least L of K clauses match, call example positive
How to choose L ? L=1 then high recall, low precision L=K then low recall, high precision
Our method Choose L such that ensemble recall matches bin b Bin b’s precision should be higher than any clause in it
We should now have set of high precision rule sets spanning space of recall levels
![Page 9: Learning Ensembles of First-Order Clauses for Recall-Precision Curves A Case Study in Biomedical Information Extraction Mark Goadrich, Louis Oliphant and.](https://reader036.fdocuments.in/reader036/viewer/2022081503/56649eb65503460f94bbeef9/html5/thumbnails/9.jpg)
How to use GleanerP
reci
sion
Recall
Generate Curve User Selects Recall Bin Return Classifications
With Precision Confidence
Recall = 0.50Precision = 0.70
![Page 10: Learning Ensembles of First-Order Clauses for Recall-Precision Curves A Case Study in Biomedical Information Extraction Mark Goadrich, Louis Oliphant and.](https://reader036.fdocuments.in/reader036/viewer/2022081503/56649eb65503460f94bbeef9/html5/thumbnails/10.jpg)
Aleph Ensembles We compare to ensembles of theories Algorithm (Dutra et al ILP 2002)
Use K different initial seeds Learn K theories containing C clauses Rank examples by the number of theories
Need to balance C for high performance Small C leads to low recall Large C leads to converging theories
![Page 11: Learning Ensembles of First-Order Clauses for Recall-Precision Curves A Case Study in Biomedical Information Extraction Mark Goadrich, Louis Oliphant and.](https://reader036.fdocuments.in/reader036/viewer/2022081503/56649eb65503460f94bbeef9/html5/thumbnails/11.jpg)
Aleph Ensembles (100 theories)
0 . 0 0
0 . 1 0
0 . 2 0
0 . 3 0
0 . 4 0
0 . 5 0
0 . 6 0
0 5 0 1 0 0 1 5 0 2 0 0 2 5 0 3 0 0
N u m b e r o f C l a u s e s U s e d P e r T h e o r y
Te
sts
et
AU
RP
C
![Page 12: Learning Ensembles of First-Order Clauses for Recall-Precision Curves A Case Study in Biomedical Information Extraction Mark Goadrich, Louis Oliphant and.](https://reader036.fdocuments.in/reader036/viewer/2022081503/56649eb65503460f94bbeef9/html5/thumbnails/12.jpg)
Biomedical Information Extraction Given: Medical Journal abstracts tagged
with protein localization relations Do: Construct system to extract protein
localization phrases from unseen text
NPL3 encodes a nuclear protein with an RNA recognition motif and similarities to a family of proteins involved in RNA metabolism.
![Page 13: Learning Ensembles of First-Order Clauses for Recall-Precision Curves A Case Study in Biomedical Information Extraction Mark Goadrich, Louis Oliphant and.](https://reader036.fdocuments.in/reader036/viewer/2022081503/56649eb65503460f94bbeef9/html5/thumbnails/13.jpg)
Biomedical Information Extraction Hand-labeled dataset (Ray & Craven ’01)
7,245 sentences from 871 abstracts Examples are phrase-phrase combinations
1,810 positive & 279,154 negative
1.6 GB of background knowledge Structural, Statistical, Lexical and Ontological In total, 200+ distinct background predicates
![Page 14: Learning Ensembles of First-Order Clauses for Recall-Precision Curves A Case Study in Biomedical Information Extraction Mark Goadrich, Louis Oliphant and.](https://reader036.fdocuments.in/reader036/viewer/2022081503/56649eb65503460f94bbeef9/html5/thumbnails/14.jpg)
Evaluation Metrics Two dimensions
Area Under Recall-Precision Curve (AURPC)
All curves standardized to cover full recall range
Averaged AURPC over 5 folds
Number of clauses considered
Rough estimate of time Both are “stop anytime”
parallel algorithms
Recall
Pre
cisi
on
1.0
1.0
![Page 15: Learning Ensembles of First-Order Clauses for Recall-Precision Curves A Case Study in Biomedical Information Extraction Mark Goadrich, Louis Oliphant and.](https://reader036.fdocuments.in/reader036/viewer/2022081503/56649eb65503460f94bbeef9/html5/thumbnails/15.jpg)
AURPC Interpolation Convex interpolation in RP space?
Precision interpolation is counterintuitive Example: 1000 positive & 9000 negative
TP FP TP Rate FP Rate Recall Prec
500 500 0.50 0.06 0.50 0.50
1000 9000 1.00 1.00 1.00 0.10
Example Counts RP CurvesROC Curves
750 4750 0.75 0.53 0.75 0.14
![Page 16: Learning Ensembles of First-Order Clauses for Recall-Precision Curves A Case Study in Biomedical Information Extraction Mark Goadrich, Louis Oliphant and.](https://reader036.fdocuments.in/reader036/viewer/2022081503/56649eb65503460f94bbeef9/html5/thumbnails/16.jpg)
AURPC Interpolation
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.2 0.4 0.6 0.8 1.0
Recall
Pre
cis
ion
Correct Interpolation Incorrect Interpolation
![Page 17: Learning Ensembles of First-Order Clauses for Recall-Precision Curves A Case Study in Biomedical Information Extraction Mark Goadrich, Louis Oliphant and.](https://reader036.fdocuments.in/reader036/viewer/2022081503/56649eb65503460f94bbeef9/html5/thumbnails/17.jpg)
Experimental Methodology Performed five-fold cross-validation Variation of parameters
Gleaner (20 recall bins) # seeds = {25, 50, 75, 100} # clauses = {1K, 10K, 25K, 50K, 100K, 250K, 500K}
Ensembles (0.75 minacc, 35,000 nodes) # theories = {10, 25, 50, 75, 100} # clauses per theory = {1, 5, 10, 15, 20, 25, 50}
![Page 18: Learning Ensembles of First-Order Clauses for Recall-Precision Curves A Case Study in Biomedical Information Extraction Mark Goadrich, Louis Oliphant and.](https://reader036.fdocuments.in/reader036/viewer/2022081503/56649eb65503460f94bbeef9/html5/thumbnails/18.jpg)
Results: Testfold 5 at 1,000,000 clauses
0 . 0
0 . 1
0 . 2
0 . 3
0 . 4
0 . 5
0 . 6
0 . 7
0 . 8
0 . 9
1 . 0
0 . 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 1 . 0
R e c a l l
Pre
cis
ion
Ensembles
Gleaner
![Page 19: Learning Ensembles of First-Order Clauses for Recall-Precision Curves A Case Study in Biomedical Information Extraction Mark Goadrich, Louis Oliphant and.](https://reader036.fdocuments.in/reader036/viewer/2022081503/56649eb65503460f94bbeef9/html5/thumbnails/19.jpg)
Results: Gleaner vs Aleph Ensembles
0 . 0 0
0 . 0 5
0 . 1 0
0 . 1 5
0 . 2 0
0 . 2 5
0 . 3 0
0 . 3 5
0 . 4 0
0 . 4 5
0 . 5 0
1 0 , 0 0 0 1 0 0 , 0 0 0 1 , 0 0 0 , 0 0 0 1 0 , 0 0 0 , 0 0 0 1 0 0 , 0 0 0 , 0 0 0
N u m b e r o f C la u s e s G e n e r a t e d ( L o g a r i t h m ic S c a le )
Te
sts
et
AU
RP
C
G le a n e r A le p h E n s e m b le s
![Page 20: Learning Ensembles of First-Order Clauses for Recall-Precision Curves A Case Study in Biomedical Information Extraction Mark Goadrich, Louis Oliphant and.](https://reader036.fdocuments.in/reader036/viewer/2022081503/56649eb65503460f94bbeef9/html5/thumbnails/20.jpg)
Conclusions Gleaner
Focuses on recall and precision Keeps wide spectrum of clauses Good results in few cpu cycles
Aleph ensembles ‘Early stopping’ helpful Require more cpu cycles
AURPC Useful metric for comparison Interpolation unintuitive
![Page 21: Learning Ensembles of First-Order Clauses for Recall-Precision Curves A Case Study in Biomedical Information Extraction Mark Goadrich, Louis Oliphant and.](https://reader036.fdocuments.in/reader036/viewer/2022081503/56649eb65503460f94bbeef9/html5/thumbnails/21.jpg)
Future Work Improve Gleaner performance over time Explore alternate clause combinations Better understanding of AURPC Search for clauses that optimize AURPC Examine more ILP link-learning datasets Use Gleaner with other ML algorithms
![Page 22: Learning Ensembles of First-Order Clauses for Recall-Precision Curves A Case Study in Biomedical Information Extraction Mark Goadrich, Louis Oliphant and.](https://reader036.fdocuments.in/reader036/viewer/2022081503/56649eb65503460f94bbeef9/html5/thumbnails/22.jpg)
Take-Home Message Definition of Gleaner
One who gathers grain left behind by reapers
Gleaner and ILP Many clauses constructed and evaluated in ILP
hypothesis search We need to make better use of those that aren’t
the highest scoring ones
Thanks, Questions?
![Page 23: Learning Ensembles of First-Order Clauses for Recall-Precision Curves A Case Study in Biomedical Information Extraction Mark Goadrich, Louis Oliphant and.](https://reader036.fdocuments.in/reader036/viewer/2022081503/56649eb65503460f94bbeef9/html5/thumbnails/23.jpg)
Acknowledgements USA NLM Grant 5T15LM007359-02 USA NLM Grant 1R01LM07050-01 USA DARPA Grant F30602-01-2-0571 USA Air Force Grant F30602-01-2-0571 Condor Group David Page Vitor Santos Costa, Ines Dutra Soumya Ray, Marios Skounakis, Mark Craven
Dataset available at (URL in proceedings)ftp://ftp.cs.wisc.edu/machine-learning/shavlik-group/datasets/IE-protein-location
![Page 24: Learning Ensembles of First-Order Clauses for Recall-Precision Curves A Case Study in Biomedical Information Extraction Mark Goadrich, Louis Oliphant and.](https://reader036.fdocuments.in/reader036/viewer/2022081503/56649eb65503460f94bbeef9/html5/thumbnails/24.jpg)
Deleted Scenes Aleph Learning Clause Weighting Sample Gleaner Recall-Precision Curve Sample Extraction Clause Gleaner Algorithm
Director Commentaryon off
![Page 25: Learning Ensembles of First-Order Clauses for Recall-Precision Curves A Case Study in Biomedical Information Extraction Mark Goadrich, Louis Oliphant and.](https://reader036.fdocuments.in/reader036/viewer/2022081503/56649eb65503460f94bbeef9/html5/thumbnails/25.jpg)
Aleph - Learning Aleph learns theories of clauses
(Srinivasan, v4, 2003) Pick a positive seed example and saturate Use heuristic search to find best clause Pick new seed from uncovered positives
and repeat until threshold of positives covered
Theory produces one recall-precision point Learning complete theories is time-consuming Can produce ranking with theory ensembles
![Page 26: Learning Ensembles of First-Order Clauses for Recall-Precision Curves A Case Study in Biomedical Information Extraction Mark Goadrich, Louis Oliphant and.](https://reader036.fdocuments.in/reader036/viewer/2022081503/56649eb65503460f94bbeef9/html5/thumbnails/26.jpg)
Clause Weighting Single Theory Ensemble
rank by how many clauses cover examples
Weight clauses using tuneset statistics CN2 (average precision of matching clauses) Lowest False Positive Rate Score Cumulative
F1 score Recall Precision Diversity
![Page 27: Learning Ensembles of First-Order Clauses for Recall-Precision Curves A Case Study in Biomedical Information Extraction Mark Goadrich, Louis Oliphant and.](https://reader036.fdocuments.in/reader036/viewer/2022081503/56649eb65503460f94bbeef9/html5/thumbnails/27.jpg)
Clause Weighting
0 . 0 0
0 . 0 5
0 . 1 0
0 . 1 5
0 . 2 0
0 . 2 5
0 . 3 0
0 . 3 5
0 . 4 0
0 . 4 5
P r e c i s i o n E q u a l R a n k e d L i s t C N 2W e ig h t in g S c h e m e s
AU
RP
C
![Page 28: Learning Ensembles of First-Order Clauses for Recall-Precision Curves A Case Study in Biomedical Information Extraction Mark Goadrich, Louis Oliphant and.](https://reader036.fdocuments.in/reader036/viewer/2022081503/56649eb65503460f94bbeef9/html5/thumbnails/28.jpg)
Further Results
0 . 0 0
0 . 0 5
0 . 1 0
0 . 1 5
0 . 2 0
0 . 2 5
0 . 3 0
0 . 3 5
0 . 4 0
0 . 4 5
0 . 5 0
1 0 , 0 0 0 1 0 0 , 0 0 0 1 , 0 0 0 , 0 0 0 1 0 , 0 0 0 , 0 0 0 1 0 0 , 0 0 0 , 0 0 0
N u m b e r o f C la u s e s G e n e r a t e d ( L o g a r it h m ic S c a le )
Te
sts
et
AU
RP
C
G le a n e r A le p h E n s e m b le s E n s e m b le s 1 K
![Page 29: Learning Ensembles of First-Order Clauses for Recall-Precision Curves A Case Study in Biomedical Information Extraction Mark Goadrich, Louis Oliphant and.](https://reader036.fdocuments.in/reader036/viewer/2022081503/56649eb65503460f94bbeef9/html5/thumbnails/29.jpg)
Biomedical Information Extraction
NPL3 encodes a nuclear protein with …
verbnoun article adj noun prep
sentence
prepphrase
…verb
phrasenoun
phrasenoun
phrase
alphanumeric marked location
![Page 30: Learning Ensembles of First-Order Clauses for Recall-Precision Curves A Case Study in Biomedical Information Extraction Mark Goadrich, Louis Oliphant and.](https://reader036.fdocuments.in/reader036/viewer/2022081503/56649eb65503460f94bbeef9/html5/thumbnails/30.jpg)
Sample Extraction Clause
P = Protein, L = Location, S = Sentence 29% Recall 34% Precision on testset 1
S
C B
Aarticle
containsalphanumeric
containsalphanumeric
Pnoun
Lnoun
containsmarkedlocation
contains nobetween halfX verb
![Page 31: Learning Ensembles of First-Order Clauses for Recall-Precision Curves A Case Study in Biomedical Information Extraction Mark Goadrich, Louis Oliphant and.](https://reader036.fdocuments.in/reader036/viewer/2022081503/56649eb65503460f94bbeef9/html5/thumbnails/31.jpg)
Gleaner Algorithm Create B equal-sized recall bins For K different seeds
Generate rules using Rapid Random Restart Record best rule (precision x recall)
found for each bin For each recall bin B
Find threshold L of K clauses such thatrecall of “at least L of K clauses match examples”= recall for this bin
Find recall and precision on testset using each bin’s “at least L of K” decision process