Bo Lin Kevin Dela Rosa Rushin Shah. As part of our research, we are working on a cross- document...

Attribute & Relationship Extraction for Co-reference Chains

Project Status Report, 10-707Bo Lin

Kevin Dela RosaRushin Shah

As part of our research, we are working on a cross-

document co-reference resolution system Co-reference Resolution: Extract all noun phrases from a

document (names, descriptions, pronouns), and cluster

them according to the real-world entity they describe. Each such cluster a chain Within-doc: Cluster NPs from a single document Cross-doc: Cluster NPs from different documents

Background – Co-reference Resolution

Run a WDC system on a document and extract chains corresponding to real-world entities. For each chain, track all sentences from which its mentions are obtained.

Features over pairs of such chains:◦ SoftTFIDF similarity between names◦ All words TFIDF cosine similarity (over sentences)◦ Named Entity (NE) TFIDF cosine similarity (over sentences)◦ NE SoftTFIDF cosine similarity (over sentences)◦ Semantic similarity between the NPs of each chain

Train an SVM that classifies pairs of chains as co-referent or not Use SVM to cluster all chains from all documents in the corpus. Store this clustering in a database. Each entity a list of chains

Background – Architecture of our CDC system

Augment our CDC system with the following: Attribute Extraction◦ For each chain, extract attributes such as gender, occupation,

nationality, birthdates if they exist◦ Use attributes to enable SVM to do better co-reference

Relationship Extraction◦ For each pair of chains, extract a relationship (e.g. part of,

role, etc), if it exists◦ Use relationships for better visualization of clusters

Goals

Patterns: Take seed examples of (entity, attribute) and learn extraction patterns

Position: Use typical document positions of attributes

Transitive: Use attributes of neighboring entities

Latent: Use document-level topic models to infer attributes

References: Ravinchandran & Hovy ‘02, Garera & Yarowsky ‘09

Proposed Techniques – Attribute Extraction

Attribute Extraction: Context/Pattern-Based Model

Idea: Attribute values should appear with linguistic clues around it, i.e.

it can be defined as a probabilistic language model describing the chance of a word being an attribute given the context. This idea is essentially the same as in KnowItAll or Brin ‘98.

Example: ◦Marked Text:

“<name> (born <dob>) is a former American <occupation>, active <occupation>…”

◦Generated Context:“(born X)” for X=<dob>, “a former American X” for X=<occupation> …

Idea: Certain biographical attributes tend to appear in

characteristic positions, often near top of article. Relative position /rank between attributes can be helpful information as well.

Example: “Birthdate” is often times the first date in a biography

text, or at least near the beginning of the article, and relatively speaking “Deathdate” almost always occurs sometime afterwards

Attribute Extraction: Position-Based Model

Attribute Extraction: Transitivity-Based Models Idea: Intuition is that a named entity is more likely

mentioned together with another entity with similar attribute values (the most applicable ones seem to be “occupation”, “religions”)

Example: “Michael Jordan” (the player) is mostly mentioned

together with “Wilt Chamberlain”, “Dennis Rodman” and fellow players.

Idea: Use “latent wide-document-context” models to detect

attributes that may not have been mentioned directly in article

Example: Words such as “songs, album, recorded” can all

collectively indicate an occupation of singer or musician

Attribute Extraction: Latent Models

Three options:◦ Extract variety of features and train YFCL◦ Define Kernels to measure similarity between instances, plug

into SVM◦ Semi-supervised approach. Start with seed examples, learn

patterns, iterate. Grow KB (E.g. KnowItAll, TextRunner)

Ideally would prefer semi-supervised, especially since it allows open domain IE, but very time and labor-intensive

Kernel approach more elegant, works better than YFCL Therefore, we proposed to use Kernels

Proposed Techniques – Relation Extraction

Attributes: NNDB seed pairs, Wikipedia pages

Relations: ACE Phase 2 top level relations

For our CDC system:

◦ John Smith corpus

◦WePS (person name disambiguation) corpora

Proposed Datasets

Data Preparation◦ Collected attribute extraction data set

Currently our seed pairs have occupation, date-of-birth, and date-of-death values we plan on collecting birthplace, gender, and nationality

◦ Implemented the program to parse simple Wikipedia pages

Framework◦ Implemented a pipe-line process which consists of

Loading the list of names and pages Parsing Wikipedia pages Sentence segmentation POS tagging / NE Tagging (If needed) Modeling and Extraction

◦ Provide capability to support different models for different attributes

Progress Report – Attribute Extraction

Implementations◦ Implemented context/pattern-based model, currently focusing

on occupations◦ Implemented position-based models (absolute & relative), on

occupation and date of birth attribute extraction◦ In progress of implementing transitivity-based model for

occupation extraction ◦ In progress of implementing latent wide-document-context

models for extracting implicit attributes


IssuesActual implementation raises a lot of problems missed in the paper such as whether POS Tagging should be included in the model, the pipeline framework is intended to solve this issue.

Plan◦ To finish the implementations and compare with two baseline

models for each attributes◦ Extend code to work on chains and consolidate the attributes◦ Use attributes extracted as features in the CDC system’s SVM

classifier to help co-reference resolution


Evaluated different kinds of Kernels (subsequence, dependency trees, etc).

Chose subsequence kernels as these perform well and robust to ill-formed documents

As a template, decided to follow the Bunescu and Mooney paper discussed in class

Observation: Although the idea in the paper is easy to understand, implementation is quite complex

Progress Report – Relation Extraction

For SVM, decided to use LibSVM. Bunescu has a modified version that can take custom Kernels, but based on older release

Made updates to LibSVM to reconcile different versions.

◦ Now accepts custom/pre-computed Kernels ◦ Allows richer representation of instances than numerical

values (e.g. sentence fragments, named entities, etc.)


As we know, hardest part of coding Machine Learning applications isn’t the classifier (plenty of libraries), but feature extraction

Wrote feature extraction code that extracts sentences from XML and produces instances for SVM, where each instance consists of:◦ The 2 Named Entities of interest◦ Sentence fragments before, between and after NEs

If a sentence consists of > 2 NEs, make NC2 copies


Implemented general subsequence Kernel algorithm

Currently working on adapting this for the specific case of relationship Kernels (as explained in Bunescu & Mooney)

Once done, will extend this code (which works on sentences), to chains, so it can be plugged into CDC system


Questions?

Bo Lin Kevin Dela Rosa Rushin Shah. As part of our research, we are working on a cross- document...

Documents

Transcript of Bo Lin Kevin Dela Rosa Rushin Shah. As part of our research, we are working on a cross- document...