Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Date: 2007.06.21
description
Transcript of Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Date: 2007.06.21
IJCAI 2003 Workshop on Learning Statistical Models from Relational Data
First-Order Probabilistic Models for Information Extraction
Advisor: Hsin-His ChenReporter: Chi-Hsin YuDate: 2007.06.21
Bhaskara Marthi, Brian Milch, Stuart Russell
Computer Science Div.University of California
NIPS 15th, 2003
Identity Uncertainty and Citation Matching
Hanna Pasula, Bhaskara Marthi, Brian Milch, Stuart Russell, Ilya Shpitser
Computer Science Div.University of California
Outlines
Introduction Related works Models for the bibliography domain Experiment on model A Desiderata for a FOPL Conclusions
2/18
Introduction –Citation Matching Problem
Citation matching: the problem of deciding which citations correspond to the sam
e publication Difficulties
Different citation styles An imperfect copy of the book’s title Different ways to refer an object (identity) Ambiguity
“Wauchope, K. Eucalyptus: Integrating Natural language Input with a Graphical User Interface”
Author: “Wauchope, K. Eucalyptus” or “Wauchope, K.” ? Tasks
Parsing Disambiguation Matching
3/18
Introduction – Citation Matching Problem: Examples
4/18
Journal of Artificial Intelligence Research, or Artificial Intelligence Journal ??
Introduction –First-Order Probabilistic Models
5/18
Logic Probabilistic ModelPropositional
Formula A B P(j m a b e),Bayesian Network
Inference/ Algorithms
Resolution, Model Checking, Forward chaining, DPLL, WalkSAT …
Bayes’ rule, Summing-out, smoothing, prediction, approximation (likely-hood, MCMC …), …
First-order
Formula x King(x) Greedy(x) Evil(x) x King(x) Greedy(x) Evil(x)= 0.8766
Inference/ Algorithms
Unification, Resolution, … Learning, Approximation, …
System/ Languages
Prolog, Rule Engine (JBoss), …
FOPL, RPM
Introduction –Result of Model B
6/18
Related Works
IE the Message Understanding Conferences [DARPA,1
998] Bayesian modeling
finding stochastically repeated patterns (motifs) in DNA sequences [Xing et al., 2003]
Robot localization [Anguelov et al., 2002] FOPL/RPM (Relational Prob. Model)
A. Pfeffer. Probabilistic Reasoning for Complex Systems. PhD thesis, Stanford, 2000.
7/18
Models for the Bibliography Domain –Model A
[Pasula et al. 2003]
8/18
Models for the Bibliography Domain –Model A (Cont.)
Suggest a declarative approach to identity uncertainty using a formal language
Algorithm Steps
Generate objects/instances Parse and fill attributes Inference (Approximation, MCMC)
Cluster the identity (publication)
9/18
Models for the Bibliography Domain –Model A (Cont.)
Attributes using unconditional probability learn several bigram models
letter-based models of first names, surnames, and title words using the following resources
the 2000 Census data on US names a large A.I. BibTeX bibliography a hand-parsed collection of 500 citations
Attributes using conditional probability Using noise channels for some attributes
the corruption models of Citation.obsTitle, AuthorAsCited.surname, and AuthorAsCited.fnames
The parameters of the corruption models are learnt online, using stochastic EM Citation.parse
It keeps track of the segmentation of Citation.text An author segment, a title segment, and three filler segments (one before, one
after, and one in between) Citation.text
Be constrained by Citation.parse, Paper.pubType, … These models were learned using our pre-segmented file.
10/18
Models for the Bibliography Domain –Model B
11/18
CitationPublicationTitleAsCitedAuthorsAsCitedText Parse
CollectionName, Type, Date,Publisher
PublisherName, City
AuthorsNameArea+ (Fields)
PublicationTitleAreaType (Book/conf. …)AuthorListCollection
Citation GroupsType (Area, Author)StylePublicationListCitationList
Models for the Bibliography Domain –Model B (Cont.)
Generating objects The set of Author objects, and the set of Collection
objects are generated independently. the set of Publication objects is generated conditio
nal on the Authors and Collections. CitationGroup objects are generated conditional o
n the Authors and Collections. Citation objects are generated from the CitationGr
oups.
12/18
Models for the Bibliography Domain –Model B (Cont.)
Fill attributes Author.Name
is chosen from a mixture of a letter bigram distribution with a distribution that chooses from a set of commonly occurring names
Publications.Title is generated from an n-gram model, condition
ed on Publications.area More specific relations and conditions bet
ween attributes
13/18
Experiment on model A –Experiment Setting
Dataset Citeseer’s hand-matched datasets Each of these datasets contains several hundred cit
ations of machine learning papers Citeseer’s phrase matching algorithm
a greedy agglomerative clustering method based on a metric that measures the degrees to w
hich the words and phrases of any two citations overlap
half of them in clusters ranging in size from two to twenty-one citations
14/18
Experiment on model A –Experiment Result
15/18
Desiderata for a FOPL
Contains A probability distribution over possible worlds The expression power to model the relational structure of
the world An efficient inference algorithm A learning procedure which allows priors over the
parameters Has the ability
to answer queries to make inferences about the existence or nonexistence of
objects having particular properties to represent common types of compound objects to represent probabilistic dependencies to incorporate domain knowledge into the inference
algorithms
16/18
Conclusions
First-order probabilistic models a useful, probably necessary, component of
any system that extracts complex relational information from unstructured text data
Some of the directions we plan to pursue in the future defining a representation language that allows
such models to be specified declaratively, scaling up the inference procedure to handle
large knowledge bases
17/18
Thanks!!