Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Date: 2007.06.21

IJCAI 2003 Workshop on Learning Statistical Models from Relational Data

First-Order Probabilistic Models for Information Extraction

Advisor: Hsin-His ChenReporter: Chi-Hsin YuDate: 2007.06.21

Bhaskara Marthi, Brian Milch, Stuart Russell

Computer Science Div.University of California

NIPS 15th, 2003

Identity Uncertainty and Citation Matching

Hanna Pasula, Bhaskara Marthi, Brian Milch, Stuart Russell, Ilya Shpitser

Computer Science Div.University of California

Outlines

Introduction Related works Models for the bibliography domain Experiment on model A Desiderata for a FOPL Conclusions

2/18

Introduction –Citation Matching Problem

Citation matching: the problem of deciding which citations correspond to the sam

e publication Difficulties

Different citation styles An imperfect copy of the book’s title Different ways to refer an object (identity) Ambiguity

“Wauchope, K. Eucalyptus: Integrating Natural language Input with a Graphical User Interface”

Author: “Wauchope, K. Eucalyptus” or “Wauchope, K.” ? Tasks

Parsing Disambiguation Matching

3/18

Introduction – Citation Matching Problem: Examples

4/18

Journal of Artificial Intelligence Research, or Artificial Intelligence Journal ??

Introduction –First-Order Probabilistic Models

5/18

Logic Probabilistic ModelPropositional

Formula A B P(j m a b e),Bayesian Network

Inference/ Algorithms

Resolution, Model Checking, Forward chaining, DPLL, WalkSAT …

Bayes’ rule, Summing-out, smoothing, prediction, approximation (likely-hood, MCMC …), …

First-order

Formula x King(x) Greedy(x) Evil(x) x King(x) Greedy(x) Evil(x)= 0.8766

Inference/ Algorithms

Unification, Resolution, … Learning, Approximation, …

System/ Languages

Prolog, Rule Engine (JBoss), …

FOPL, RPM

Introduction –Result of Model B

6/18

Related Works

IE the Message Understanding Conferences [DARPA,1

998] Bayesian modeling

finding stochastically repeated patterns (motifs) in DNA sequences [Xing et al., 2003]

Robot localization [Anguelov et al., 2002] FOPL/RPM (Relational Prob. Model)

A. Pfeffer. Probabilistic Reasoning for Complex Systems. PhD thesis, Stanford, 2000.

7/18

Models for the Bibliography Domain –Model A

[Pasula et al. 2003]

8/18

Models for the Bibliography Domain –Model A (Cont.)

Suggest a declarative approach to identity uncertainty using a formal language

Algorithm Steps

Generate objects/instances Parse and fill attributes Inference (Approximation, MCMC)

Cluster the identity (publication)

9/18

Models for the Bibliography Domain –Model A (Cont.)

Attributes using unconditional probability learn several bigram models

letter-based models of first names, surnames, and title words using the following resources

the 2000 Census data on US names a large A.I. BibTeX bibliography a hand-parsed collection of 500 citations

Attributes using conditional probability Using noise channels for some attributes

the corruption models of Citation.obsTitle, AuthorAsCited.surname, and AuthorAsCited.fnames

The parameters of the corruption models are learnt online, using stochastic EM Citation.parse

It keeps track of the segmentation of Citation.text An author segment, a title segment, and three filler segments (one before, one

after, and one in between) Citation.text

Be constrained by Citation.parse, Paper.pubType, … These models were learned using our pre-segmented file.

10/18

Models for the Bibliography Domain –Model B

11/18

CitationPublicationTitleAsCitedAuthorsAsCitedText Parse

CollectionName, Type, Date,Publisher

PublisherName, City

AuthorsNameArea+ (Fields)

PublicationTitleAreaType (Book/conf. …)AuthorListCollection

Citation GroupsType (Area, Author)StylePublicationListCitationList

Models for the Bibliography Domain –Model B (Cont.)

Generating objects The set of Author objects, and the set of Collection

objects are generated independently. the set of Publication objects is generated conditio

nal on the Authors and Collections. CitationGroup objects are generated conditional o

n the Authors and Collections. Citation objects are generated from the CitationGr

oups.

12/18

Models for the Bibliography Domain –Model B (Cont.)

Fill attributes Author.Name

is chosen from a mixture of a letter bigram distribution with a distribution that chooses from a set of commonly occurring names

Publications.Title is generated from an n-gram model, condition

ed on Publications.area More specific relations and conditions bet

ween attributes

13/18

Experiment on model A –Experiment Setting

Dataset Citeseer’s hand-matched datasets Each of these datasets contains several hundred cit

ations of machine learning papers Citeseer’s phrase matching algorithm

a greedy agglomerative clustering method based on a metric that measures the degrees to w

hich the words and phrases of any two citations overlap

half of them in clusters ranging in size from two to twenty-one citations

14/18

Experiment on model A –Experiment Result

15/18

Desiderata for a FOPL

Contains A probability distribution over possible worlds The expression power to model the relational structure of

the world An efficient inference algorithm A learning procedure which allows priors over the

parameters Has the ability

to answer queries to make inferences about the existence or nonexistence of

objects having particular properties to represent common types of compound objects to represent probabilistic dependencies to incorporate domain knowledge into the inference

algorithms

16/18

Conclusions

First-order probabilistic models a useful, probably necessary, component of

any system that extracts complex relational information from unstructured text data

Some of the directions we plan to pursue in the future defining a representation language that allows

such models to be specified declaratively, scaling up the inference procedure to handle

large knowledge bases

17/18

Thanks!!

Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Date: 2007.06.21

Documents

Transcript of Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Date: 2007.06.21