Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Date: 2007.06.21

18
Relational Data First-Order Probabilistic Models for Information Extraction Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Bhaskara Marthi, Brian Milch, Stuart Russell Computer Science Div. University of California NIPS 15th, 2003 Identity Uncertainty and Citation Matching Hanna Pasula, Bhaskara Marthi, Brian Milch, Stuart Russell, Ilya Shpitser Computer Science Div. University of California

description

IJCAI 2003 Workshop on Learning Statistical Models from Relational Data First-Order Probabilistic Models for Information Extraction. NIPS 15th, 2003 Identity Uncertainty and Citation Matching. Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Date: 2007.06.21. Outlines. Introduction - PowerPoint PPT Presentation

Transcript of Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Date: 2007.06.21

Page 1: Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Date: 2007.06.21

IJCAI 2003 Workshop on Learning Statistical Models from Relational Data

First-Order Probabilistic Models for Information Extraction

Advisor: Hsin-His ChenReporter: Chi-Hsin YuDate: 2007.06.21

Bhaskara Marthi, Brian Milch, Stuart Russell

Computer Science Div.University of California

NIPS 15th, 2003

Identity Uncertainty and Citation Matching

Hanna Pasula, Bhaskara Marthi, Brian Milch, Stuart Russell, Ilya Shpitser

Computer Science Div.University of California

Page 2: Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Date: 2007.06.21

Outlines

Introduction Related works Models for the bibliography domain Experiment on model A Desiderata for a FOPL Conclusions

2/18

Page 3: Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Date: 2007.06.21

Introduction –Citation Matching Problem

Citation matching: the problem of deciding which citations correspond to the sam

e publication Difficulties

Different citation styles An imperfect copy of the book’s title Different ways to refer an object (identity) Ambiguity

“Wauchope, K. Eucalyptus: Integrating Natural language Input with a Graphical User Interface”

Author: “Wauchope, K. Eucalyptus” or “Wauchope, K.” ? Tasks

Parsing Disambiguation Matching

3/18

Page 4: Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Date: 2007.06.21

Introduction – Citation Matching Problem: Examples

4/18

Journal of Artificial Intelligence Research, or Artificial Intelligence Journal ??

Page 5: Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Date: 2007.06.21

Introduction –First-Order Probabilistic Models

5/18

Logic Probabilistic ModelPropositional

Formula A B P(j m a b e),Bayesian Network

Inference/ Algorithms

Resolution, Model Checking, Forward chaining, DPLL, WalkSAT …

Bayes’ rule, Summing-out, smoothing, prediction, approximation (likely-hood, MCMC …), …

First-order

Formula x King(x) Greedy(x) Evil(x) x King(x) Greedy(x) Evil(x)= 0.8766

Inference/ Algorithms

Unification, Resolution, … Learning, Approximation, …

System/ Languages

Prolog, Rule Engine (JBoss), …

FOPL, RPM

Page 6: Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Date: 2007.06.21

Introduction –Result of Model B

6/18

Page 7: Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Date: 2007.06.21

Related Works

IE the Message Understanding Conferences [DARPA,1

998] Bayesian modeling

finding stochastically repeated patterns (motifs) in DNA sequences [Xing et al., 2003]

Robot localization [Anguelov et al., 2002] FOPL/RPM (Relational Prob. Model)

A. Pfeffer. Probabilistic Reasoning for Complex Systems. PhD thesis, Stanford, 2000.

7/18

Page 8: Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Date: 2007.06.21

Models for the Bibliography Domain –Model A

[Pasula et al. 2003]

8/18

Page 9: Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Date: 2007.06.21

Models for the Bibliography Domain –Model A (Cont.)

Suggest a declarative approach to identity uncertainty using a formal language

Algorithm Steps

Generate objects/instances Parse and fill attributes Inference (Approximation, MCMC)

Cluster the identity (publication)

9/18

Page 10: Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Date: 2007.06.21

Models for the Bibliography Domain –Model A (Cont.)

Attributes using unconditional probability learn several bigram models

letter-based models of first names, surnames, and title words using the following resources

the 2000 Census data on US names a large A.I. BibTeX bibliography a hand-parsed collection of 500 citations

Attributes using conditional probability Using noise channels for some attributes

the corruption models of Citation.obsTitle, AuthorAsCited.surname, and AuthorAsCited.fnames

The parameters of the corruption models are learnt online, using stochastic EM Citation.parse

It keeps track of the segmentation of Citation.text An author segment, a title segment, and three filler segments (one before, one

after, and one in between) Citation.text

Be constrained by Citation.parse, Paper.pubType, … These models were learned using our pre-segmented file.

10/18

Page 11: Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Date: 2007.06.21

Models for the Bibliography Domain –Model B

11/18

CitationPublicationTitleAsCitedAuthorsAsCitedText Parse

CollectionName, Type, Date,Publisher

PublisherName, City

AuthorsNameArea+ (Fields)

PublicationTitleAreaType (Book/conf. …)AuthorListCollection

Citation GroupsType (Area, Author)StylePublicationListCitationList

Page 12: Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Date: 2007.06.21

Models for the Bibliography Domain –Model B (Cont.)

Generating objects The set of Author objects, and the set of Collection

objects are generated independently. the set of Publication objects is generated conditio

nal on the Authors and Collections. CitationGroup objects are generated conditional o

n the Authors and Collections. Citation objects are generated from the CitationGr

oups.

12/18

Page 13: Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Date: 2007.06.21

Models for the Bibliography Domain –Model B (Cont.)

Fill attributes Author.Name

is chosen from a mixture of a letter bigram distribution with a distribution that chooses from a set of commonly occurring names

Publications.Title is generated from an n-gram model, condition

ed on Publications.area More specific relations and conditions bet

ween attributes

13/18

Page 14: Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Date: 2007.06.21

Experiment on model A –Experiment Setting

Dataset Citeseer’s hand-matched datasets Each of these datasets contains several hundred cit

ations of machine learning papers Citeseer’s phrase matching algorithm

a greedy agglomerative clustering method based on a metric that measures the degrees to w

hich the words and phrases of any two citations overlap

half of them in clusters ranging in size from two to twenty-one citations

14/18

Page 15: Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Date: 2007.06.21

Experiment on model A –Experiment Result

15/18

Page 16: Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Date: 2007.06.21

Desiderata for a FOPL

Contains A probability distribution over possible worlds The expression power to model the relational structure of

the world An efficient inference algorithm A learning procedure which allows priors over the

parameters Has the ability

to answer queries to make inferences about the existence or nonexistence of

objects having particular properties to represent common types of compound objects to represent probabilistic dependencies to incorporate domain knowledge into the inference

algorithms

16/18

Page 17: Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Date: 2007.06.21

Conclusions

First-order probabilistic models a useful, probably necessary, component of

any system that extracts complex relational information from unstructured text data

Some of the directions we plan to pursue in the future defining a representation language that allows

such models to be specified declaratively, scaling up the inference procedure to handle

large knowledge bases

17/18

Page 18: Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Date: 2007.06.21

Thanks!!