Download - Relation Extraction for Academic Collaboration 10-709 Project Presentation

Relation Extraction for Academic Collaboration

10-709 Project Presentation

Justin Betteridge, Matthew Bilotti, Simon Fung, Sophie Wang

February 16, 2006

Academic Collaboration

When two academic researchers work together...

on a proposal by co-authoring a paper by co-chairing a committee in the same project or research group

This is evidence of Academic Collaboration

Binary, symmetric relationArguments are of type <person>

Motivation

Why might we be interested in extracting Academic Collaboration relations? Social Networking Explore the transitivity of the relation Proof-of-concept for extending relation

extraction machinery to other types of relations

Architectural Overview

IR

RelationExtractor

QueryFormulator

QueryFormulator

PatternExtractor

PatternBank

RelationBank

Co-training Algorithm

Do until termination condition is reached For each pattern in the pattern bank

Generate an IR query and send it to the IR engine getting back a set of documents

For each document in the set extract relations

Score all relations (new and old) Remove relations below threshold

Co-training Algorithm II

For each relation in the relation bank Generate an IR query and send it to the IR engine

getting back a set of documents For each document in the set,

extract context strings for patterns Score all patterns (new and old) Remove patterns below threshold

Loop

Extraction Pattern Formalism

From the proposal: “left” <x> “between” <y> “right” Arguments extracted with respect to context

Current status quo: “context string” <y> <x> argument extracted from page title

Extracts the relation CollaboratesWith( <x>, <y> )

Detecting Argument Types

CollaboratesWith( <x>, <y> ) <x> and <y> must be of type <person>

Essential to weed out low quality relations produced by noisy patterns such as “in collaboration with”

Heuristics currently encoded as regular expressions

Measuring Confidence with Coverage

Confidence for an Extraction Pattern Intuitively, relations “vote” for patterns Query each relation, try to extract the pattern score = proportion of successful relations

Confidence for a Relation Query each pattern, try to extract the relation Score = proportion of successful patterns

Issues with Coverage as Confidence

Seed relations and pattern must co-occurVery little tolerance for “new” information

It is difficult for a new pattern that broadens the scope of the relations extracted to gain enough confidence to surpass the threshold

Scores tend to zero as pools grow However, ad-hoc methods of confidence

method combination from one iteration to the next introduces a new problem: there is no way to oust bad relations or patterns once extracted

Example Seed Data for Co-Training

Extraction Patterns <x> “in collaboration with” <y> <x> “my advisor is” <y>

Relations CollaboratesWith( Tom, Roni ) CollaboratesWith( William, Ken )

Extraction Pattern Examples

Query: “my advisor is” site:cs.cmu.edu

Extracted Relations

"Miroslav Dudik""Rob Schapire" 0.3333333333333333"Personal""Prof. Sanjeev" 0.3333333333333333"Research""Professors Jonathan" 0.3333333333333333"Sharon Whiteman""Mary Vernon" 0.3333333333333333"Sudhakar""Prof. Edward" 0.3333333333333333"Ting""Professor Andrew" 0.3333333333333333"Adriana Karagiozova""Moses Charikar" 0.6666666666666666"Akash Lal""Tom Reps" 0.6666666666666666"Amy Karlson""Benjamin B. Bederson" 0.6666666666666666"Aravind Kalaiah""Dr. Amitabh" 0.6666666666666666"Chi Zhang""Randolph Y. Wang" 0.6666666666666666"Gaurav Shah""Matt Blaze" 0.6666666666666666"Jennifer Beckmann""Jeff Naughton" 0.6666666666666666"Lucja Kot""Dexter Kozen" 0.6666666666666666"Mark Sandler""Jon Kleinberg" 0.6666666666666666"Nina""Prof. Avrim" 0.6666666666666666"Patrick Ng""Uri Keich" 0.6666666666666666"Pavlos Papageorgiou""Prof. Michael" 0.6666666666666666"Pratyusa Manadhata""Jeannette M. Wing" 0.6666666666666666"Pavlos Papageorgiou""Prof. Michael" 0.6666666666666666"Pratyusa Manadhata""Jeannette M. Wing" 0.6666666666666666"Sudipta""Marc Pollefeys" 0.6666666666666666"Sven Koenig""Reid Simmons" 0.6666666666666666"Yan Liu""Jaime Carbonell" 0.6666666666666666

Learned Patterns

“My advisor is” <y> 0.6Near misses (hard to assess confidence):

“I work with” 0.4 “Together with” 0.0667 “Languages Research under” 0.0333 “Computer Science advisor” 0.0333 “Languages under Prof” 0.0 “Study under Prof” 0.0 “currently working with” 0.0 “user studies with” 0.0

Bad Patterns

From citations: “Amit Agarwal and”, etc. (other authors) “L1 Norm with” (part of a title)

From professional titles: “Professor”, “Professor of Mathematics”, etc.

From course web pages: “courses cs686 2003sp”

Other: “be addressed to”

Software and Datasets Used

Indri retrieval engineLocally crawled collection of pages from

CS departments of universities Using a local collection greatly improved the

development experience by shortening the debugging cycle, and relieved us from the Google API query quota

No features of Indri that Google does not support were used so that Google could be substituted for Indri in the future

Future Work

Different methods of combining confidence scores including weighting of votes during scoring

Different confidence metrics, e.g., PMI Additional useful sources of information:

bibliographies, anchor text and link structure: advisor-advisee cross-refs, department or lab organization

Better argument type checking Tuning of the threshold Termination condition Integration with citations group Integrate with Google Make code run faster