Relation Extraction for Academic Collaboration
10-709 Project Presentation
Justin Betteridge, Matthew Bilotti, Simon Fung, Sophie Wang
February 16, 2006
Academic Collaboration
When two academic researchers work together...
on a proposal by co-authoring a paper by co-chairing a committee in the same project or research group
This is evidence of Academic Collaboration
Binary, symmetric relationArguments are of type <person>
Motivation
Why might we be interested in extracting Academic Collaboration relations? Social Networking Explore the transitivity of the relation Proof-of-concept for extending relation
extraction machinery to other types of relations
Architectural Overview
IR
RelationExtractor
QueryFormulator
QueryFormulator
PatternExtractor
PatternBank
RelationBank
Co-training Algorithm
Do until termination condition is reached For each pattern in the pattern bank
Generate an IR query and send it to the IR engine getting back a set of documents
For each document in the set extract relations
Score all relations (new and old) Remove relations below threshold
Co-training Algorithm II
For each relation in the relation bank Generate an IR query and send it to the IR engine
getting back a set of documents For each document in the set,
extract context strings for patterns Score all patterns (new and old) Remove patterns below threshold
Loop
Extraction Pattern Formalism
From the proposal: “left” <x> “between” <y> “right” Arguments extracted with respect to context
Current status quo: “context string” <y> <x> argument extracted from page title
Extracts the relation CollaboratesWith( <x>, <y> )
Detecting Argument Types
CollaboratesWith( <x>, <y> ) <x> and <y> must be of type <person>
Essential to weed out low quality relations produced by noisy patterns such as “in collaboration with”
Heuristics currently encoded as regular expressions
Measuring Confidence with Coverage
Confidence for an Extraction Pattern Intuitively, relations “vote” for patterns Query each relation, try to extract the pattern score = proportion of successful relations
Confidence for a Relation Query each pattern, try to extract the relation Score = proportion of successful patterns
Issues with Coverage as Confidence
Seed relations and pattern must co-occurVery little tolerance for “new” information
It is difficult for a new pattern that broadens the scope of the relations extracted to gain enough confidence to surpass the threshold
Scores tend to zero as pools grow However, ad-hoc methods of confidence
method combination from one iteration to the next introduces a new problem: there is no way to oust bad relations or patterns once extracted
Example Seed Data for Co-Training
Extraction Patterns <x> “in collaboration with” <y> <x> “my advisor is” <y>
Relations CollaboratesWith( Tom, Roni ) CollaboratesWith( William, Ken )
Extraction Pattern Examples
Query: “my advisor is” site:cs.cmu.edu
Extracted Relations
"Miroslav Dudik""Rob Schapire" 0.3333333333333333"Personal""Prof. Sanjeev" 0.3333333333333333"Research""Professors Jonathan" 0.3333333333333333"Sharon Whiteman""Mary Vernon" 0.3333333333333333"Sudhakar""Prof. Edward" 0.3333333333333333"Ting""Professor Andrew" 0.3333333333333333"Adriana Karagiozova""Moses Charikar" 0.6666666666666666"Akash Lal""Tom Reps" 0.6666666666666666"Amy Karlson""Benjamin B. Bederson" 0.6666666666666666"Aravind Kalaiah""Dr. Amitabh" 0.6666666666666666"Chi Zhang""Randolph Y. Wang" 0.6666666666666666"Gaurav Shah""Matt Blaze" 0.6666666666666666"Jennifer Beckmann""Jeff Naughton" 0.6666666666666666"Lucja Kot""Dexter Kozen" 0.6666666666666666"Mark Sandler""Jon Kleinberg" 0.6666666666666666"Nina""Prof. Avrim" 0.6666666666666666"Patrick Ng""Uri Keich" 0.6666666666666666"Pavlos Papageorgiou""Prof. Michael" 0.6666666666666666"Pratyusa Manadhata""Jeannette M. Wing" 0.6666666666666666"Pavlos Papageorgiou""Prof. Michael" 0.6666666666666666"Pratyusa Manadhata""Jeannette M. Wing" 0.6666666666666666"Sudipta""Marc Pollefeys" 0.6666666666666666"Sven Koenig""Reid Simmons" 0.6666666666666666"Yan Liu""Jaime Carbonell" 0.6666666666666666
Learned Patterns
“My advisor is” <y> 0.6Near misses (hard to assess confidence):
“I work with” 0.4 “Together with” 0.0667 “Languages Research under” 0.0333 “Computer Science advisor” 0.0333 “Languages under Prof” 0.0 “Study under Prof” 0.0 “currently working with” 0.0 “user studies with” 0.0
Bad Patterns
From citations: “Amit Agarwal and”, etc. (other authors) “L1 Norm with” (part of a title)
From professional titles: “Professor”, “Professor of Mathematics”, etc.
From course web pages: “courses cs686 2003sp”
Other: “be addressed to”
Software and Datasets Used
Indri retrieval engineLocally crawled collection of pages from
CS departments of universities Using a local collection greatly improved the
development experience by shortening the debugging cycle, and relieved us from the Google API query quota
No features of Indri that Google does not support were used so that Google could be substituted for Indri in the future
Future Work
Different methods of combining confidence scores including weighting of votes during scoring
Different confidence metrics, e.g., PMI Additional useful sources of information:
bibliographies, anchor text and link structure: advisor-advisee cross-refs, department or lab organization
Better argument type checking Tuning of the threshold Termination condition Integration with citations group Integrate with Google Make code run faster
Top Related