1/24 Learning to Extract Genic Interactions Using Gleaner LLL05 Workshop, 7 August 2005 ICML 2005,...

24
1/24 Learning to Extract Genic Interactions Using Gleaner LLL05 Workshop, 7 August 2005 ICML 2005, Bonn, Germany Mark Goadrich, Louis Oliphant and Jude Shavlik Department of Computer Sciences University of Wisconsin – Madison USA

Transcript of 1/24 Learning to Extract Genic Interactions Using Gleaner LLL05 Workshop, 7 August 2005 ICML 2005,...

Page 1: 1/24 Learning to Extract Genic Interactions Using Gleaner LLL05 Workshop, 7 August 2005 ICML 2005, Bonn, Germany Mark Goadrich, Louis Oliphant and Jude.

1/24

Learning to ExtractGenic Interactions

Using Gleaner

LLL05 Workshop, 7 August 2005ICML 2005, Bonn, Germany

Mark Goadrich, Louis Oliphant and Jude ShavlikDepartment of Computer Sciences

University of Wisconsin – Madison USA

Page 2: 1/24 Learning to Extract Genic Interactions Using Gleaner LLL05 Workshop, 7 August 2005 ICML 2005, Bonn, Germany Mark Goadrich, Louis Oliphant and Jude.

2/24

Learning Language in Logic Biomedical Information Extraction Challenge

Two tasks: with and without co-reference 80 sentences for training 40 sentences for testing

Our approach: Gleaner (ILP ‘04) Fast ensemble ILP algorithm Focused on recall and precision evaluation

LLL

Page 3: 1/24 Learning to Extract Genic Interactions Using Gleaner LLL05 Workshop, 7 August 2005 ICML 2005, Bonn, Germany Mark Goadrich, Louis Oliphant and Jude.

3/24

A Sample Positive Example Given: Medical Journal abstracts tagged

with genic interaction relations Do: Construct system to extract genic

interaction phrases from unseen text

ykuD was transcribed by SigK RNA polymerase from T4 of sporulation.

LLL

Page 4: 1/24 Learning to Extract Genic Interactions Using Gleaner LLL05 Workshop, 7 August 2005 ICML 2005, Bonn, Germany Mark Goadrich, Louis Oliphant and Jude.

4/24

All unlabeled word pairings? Wastes time with irrelevant words We know the testset will include a dictionary

Use only unlabeled pairings of words in dictionary 106 positive, 414 negative without co-reference 59 positive, 261 negative with co-reference

What is a Negative Example? LLL

Page 5: 1/24 Learning to Extract Genic Interactions Using Gleaner LLL05 Workshop, 7 August 2005 ICML 2005, Bonn, Germany Mark Goadrich, Louis Oliphant and Jude.

5/24

Tagging and Parsing

verbnoun verb prep noun noun

sentence

nounphrase

…verb

phraseprep

phrasenoun

phrase

ykuD was transcribed by SigK RNA …

LLL

Page 6: 1/24 Learning to Extract Genic Interactions Using Gleaner LLL05 Workshop, 7 August 2005 ICML 2005, Bonn, Germany Mark Goadrich, Louis Oliphant and Jude.

6/24

Some Additional Predicates High-scoring words in agent phrases

depend, bind, protein, …

High-scoring words in target phrases gene, promote, product

High-scoring BETWEEN agent & target negative, regulate, transcribe, …

Medical Subject Headings (MeSH) canonized method for indexing biomedical articles in_mesh(RNA), in_mesh(gene)

LLL

Page 7: 1/24 Learning to Extract Genic Interactions Using Gleaner LLL05 Workshop, 7 August 2005 ICML 2005, Bonn, Germany Mark Goadrich, Louis Oliphant and Jude.

7/24

Even More Predicates Lexical Predicates

Internal_caps(Word) alphanumeric(Word)

Look-ahead Phrase Predicates few_POS_in_phrase(Phrase, POS) phrase_contains_specific_word_triple(Phrase, W1, W2, W3) phrase_contains_some_marked_up_arg(Phrase, Arg#, Word, Fold)

Relative Location of Phrases agent_before_target(ExampleID) word_pair_in_between_target_phrases(ExampleID, W1, W2)

LLL

Page 8: 1/24 Learning to Extract Genic Interactions Using Gleaner LLL05 Workshop, 7 August 2005 ICML 2005, Bonn, Germany Mark Goadrich, Louis Oliphant and Jude.

8/24

Link Parser (CMU) creates parse tree Root lemma of each word (not used)

27 Syntactic Information Predicates complement_of_N_N(Word, Word) modifier_ADV_V(Word, Word) object_V_Passive_N(Word, Word)

Enriched Data From Committee LLL

Page 9: 1/24 Learning to Extract Genic Interactions Using Gleaner LLL05 Workshop, 7 August 2005 ICML 2005, Bonn, Germany Mark Goadrich, Louis Oliphant and Jude.

9/24

Gleaner Definition of Gleaner

One who gathers grain left behind by reapers

Key Ideas of Gleaner Use Aleph as underlying ILP clause engine Keep wide range of clauses usually discarded Create separate theories for different recall ranges

Page 10: 1/24 Learning to Extract Genic Interactions Using Gleaner LLL05 Workshop, 7 August 2005 ICML 2005, Bonn, Germany Mark Goadrich, Louis Oliphant and Jude.

10/24

Aleph - Background Seed Example

A positive example that our clause must cover

Bottom Clause All predicates which are true about seed example

seed

agent_target(A,T,S)

Page 11: 1/24 Learning to Extract Genic Interactions Using Gleaner LLL05 Workshop, 7 August 2005 ICML 2005, Bonn, Germany Mark Goadrich, Louis Oliphant and Jude.

11/24

Aleph - Learning Aleph learns theories of clauses

(Srinivasan, v4, 2003) Pick positive seed example, find bottom clause Use heuristic search to find best clause Pick new seed from uncovered positives

and repeat until threshold of positives covered

Theory produces one recall-precision point Learning complete theories is time-consuming Can produce ranking with ensembles

Page 12: 1/24 Learning to Extract Genic Interactions Using Gleaner LLL05 Workshop, 7 August 2005 ICML 2005, Bonn, Germany Mark Goadrich, Louis Oliphant and Jude.

12/24

Gleaner - Background Rapid Random Restart (Zelezny et al ILP 2002)

Stochastic selection of initial clause Time-limited local heuristic search Randomly choose new initial clause and repeat

seed

initial 1

initial 2

Page 13: 1/24 Learning to Extract Genic Interactions Using Gleaner LLL05 Workshop, 7 August 2005 ICML 2005, Bonn, Germany Mark Goadrich, Louis Oliphant and Jude.

13/24

Gleaner - LearningP

reci

sion

Recall

Create B Bins Generate Clauses Record Best per Bin Repeat for K seeds

Page 14: 1/24 Learning to Extract Genic Interactions Using Gleaner LLL05 Workshop, 7 August 2005 ICML 2005, Bonn, Germany Mark Goadrich, Louis Oliphant and Jude.

14/24

Gleaner - Combining Combine K clauses per bin

If at least L of K clauses match, call example positive

How to choose L ? L=1 then high recall, low  precision L=K then low  recall, high precision

We want a collection of high precision theories spanning space of recall levels

Page 15: 1/24 Learning to Extract Genic Interactions Using Gleaner LLL05 Workshop, 7 August 2005 ICML 2005, Bonn, Germany Mark Goadrich, Louis Oliphant and Jude.

15/24

Gleaner - Overlap Take topmost curve of overlapping theories

Recall

Pre

cisi

on

Page 17: 1/24 Learning to Extract Genic Interactions Using Gleaner LLL05 Workshop, 7 August 2005 ICML 2005, Bonn, Germany Mark Goadrich, Louis Oliphant and Jude.

17/24

agent_target(Agent, Target, Sentence) :-several_phrases_in_sentence(Sentence),some_wordPOS_in_sentence(Sentence,

novelword),n(Agent),alphabetic(Agent), word_parent(Agent, F),

phrase_contains_internal_cap_word(F, noun, _), few_POS_in_phrase(F, novelword),in_between_target_phrases(Agent, Target, _), n(Target).

0.14 Recall, 0.93 Precision on without co-reference training set

Sample Extraction Clause

Page 18: 1/24 Learning to Extract Genic Interactions Using Gleaner LLL05 Workshop, 7 August 2005 ICML 2005, Bonn, Germany Mark Goadrich, Louis Oliphant and Jude.

18/24

agent_target(Agent, Target, Sentence) :-avg_length_sentence(Sentence),

n(Agent),

word_previous(Target,_),

in_between_target_phrases(Agent, Target, _).

0.76 Recall, 0.49 Precision on without co-reference training set

Sample Extraction Clause

Page 19: 1/24 Learning to Extract Genic Interactions Using Gleaner LLL05 Workshop, 7 August 2005 ICML 2005, Bonn, Germany Mark Goadrich, Louis Oliphant and Jude.

19/24

Experimental Methodology Used other trainset for tuneset in both cases Testset unlabeled, but dictionary provided

Included sentences with no positives 936 total testset examples generated

Parameter Settings Gleaner (20 recall bins)

seeds = 100 clauses = 25,000

Aleph (0.75 minimum accruacy) nodes = {1K, 25K)

Page 20: 1/24 Learning to Extract Genic Interactions Using Gleaner LLL05 Workshop, 7 August 2005 ICML 2005, Bonn, Germany Mark Goadrich, Louis Oliphant and Jude.

20/24

LLL Without Co-reference Results

0 .0

0 .1

0 .2

0 .3

0 .4

0 .5

0 .6

0 .7

0 .8

0 .9

1 .0

0 .0 0 .1 0 .2 0 .3 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 1 .0

R e c a l l

Pr

ec

isio

n

Gleaner Basic

Gleaner Enriched

Aleph Basic 1K

Page 21: 1/24 Learning to Extract Genic Interactions Using Gleaner LLL05 Workshop, 7 August 2005 ICML 2005, Bonn, Germany Mark Goadrich, Louis Oliphant and Jude.

21/24

LLL With Co-reference Results

0

0 .1

0 .2

0 .3

0 .4

0 .5

0 .6

0 .7

0 .8

0 .9

1

0 0 .1 0 .2 0 .3 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 1

R e c a l l

Pr

ec

isio

n

Gleaner Basic

Gleaner Enriched

Aleph Basic 1K

Page 22: 1/24 Learning to Extract Genic Interactions Using Gleaner LLL05 Workshop, 7 August 2005 ICML 2005, Bonn, Germany Mark Goadrich, Louis Oliphant and Jude.

22/24

We Need More Datasets LLL Challenge task is small

Would prefer to do cross-validation Need labels for testset

Our ILP’04 dataset open to community ftp://ftp.cs.wisc.edu/machine-learning/shavlik-

group/datasets/IE-protein-location

Biomedical information-extraction tasks Genetic Disorder (Ray and Craven 2001) Genia BioCreAtiVe

Page 23: 1/24 Learning to Extract Genic Interactions Using Gleaner LLL05 Workshop, 7 August 2005 ICML 2005, Bonn, Germany Mark Goadrich, Louis Oliphant and Jude.

23/24

Conclusions Contributions

Develop large amount of background knowledge Exploit normally discarded clauses Visually present precision and recall trade-off

Proposed Work Achieve gains in High-Recall areas Reduce overfitting when using enriched data Increase diversity of learned clauses

Page 24: 1/24 Learning to Extract Genic Interactions Using Gleaner LLL05 Workshop, 7 August 2005 ICML 2005, Bonn, Germany Mark Goadrich, Louis Oliphant and Jude.

24/24

Acknowledgements

USA DARPA Grant F30602-01-2-0571 USA Air Force Grant F30602-01-2-0571 USA NLM Grant 5T15LM007359-02 USA NLM Grant 1R01LM07050-01 UW Condor Group David Page, Vitor Santos Costa, Ines Dutra,

Soumya Ray, Marios Skounakis, Mark Craven, Burr Settles, Jessie Davis