Adaptive Graphical Approach to Entity Resolution Dmitri V. Kalashnikov Stella Chen, Dmitri V....
-
date post
20-Jan-2016 -
Category
Documents
-
view
232 -
download
0
Transcript of Adaptive Graphical Approach to Entity Resolution Dmitri V. Kalashnikov Stella Chen, Dmitri V....
Adaptive Graphical Approach to Entity Resolution
Dmitri V. Kalashnikov
Stella Chen, Dmitri V. Kalashnikov, Sharad Mehrotra
Computer Science DepartmentUniversity of California, Irvine
Additional information is available at http://www.ics.uci.edu/~dvkCopyright © by Dmitri V. Kalashnikov, 2007
ACM IEEE Joint Conference on Digital Libraries 2007
2
Structure of the Talk
Motivation
• Generic Disambiguation Framework – High-level
• Entity Resolution Approach– Part of the Framework
• Experiments
3
Entity Resolution & Data Cleaning
Raw Dataset(s)
...J. Smith ...
.. John Smith ...
.. Jane Smith ...
MIT
Intel Inc. ?
A "nice" regular Database
Analysis on bad data leads to wrong conclusions!
•Uncertainty•Errors•Missing data
4
Why do we need “Entity Resolution”?
q Hi, I’m Jane Smith.
I’d like to apply for a faculty
position.
Wow! I am sure we will accept a strong candidate
like that!
Jane Smith – Fresh Ph.D. Tom - Recruiter
OK, let me check
something quickly …
???
Publications:1. ……2. ……3. ……
Publications:1. ……2. ……3. ……
CiteSeer Rank
5
Suspicious entries– Lets go to DBLP website
– which stores bibliographic entries of many CS authors
– Lets check two people– “A. Gupta”
– “L. Zhang”
What is the problem?
CiteSeer: the top-k most cited authors DBLP DBLP
6
Comparing raw and cleaned CiteSeer
Rank Author Location
1 (100.00%) douglas schmidt cs@wustl
2 (100.00%) rakesh agrawal almaden@ibm
3 (100.00%) hector garciamolina @
4 (100.00%) sally floyd @aciri
5 (100.00%) jennifer widom @stanford
6 (100.00%) david culler cs@berkeley
6 (100.00%) thomas henzinger eecs@berkeley
7 (100.00%) rajeev motwani @stanford
8 (100.00%) willy zwaenepoel cs@rice
9 (100.00%) van jacobson lbl@gov
10 (100.00%) rajeev alur cis@upenn
11 (100.00%) john ousterhout @pacbell
12 (100.00%) joseph halpern cs@cornell
13 (100.00%) andrew kahng @ucsd
14 (100.00%) peter stadler tbi@univie
15 (100.00%) serge abiteboul @inria
Raw CiteSeer’s Top-K Most Cited Authors
Cleaned CiteSeer’s Top-K Most Cited Authors
7
What is the lesson?
– Data should be cleaned first– E.g., determine the (unique) real authors of publications
– Solving such challenges is not always “easy”– This explains a large body of work on Entity Resolution
“Garbage in, garbage out” principle: Making decisions based on bad data, can lead to wrong results.
8
Typical Data Processing Flow
Raw Data RepresentationData CleaningExtraction Analysis
9
Two most common types of Entity Resolution
...J. Smith ...
.. John Smith ...
.. Jane Smith ...
MIT
Intel Inc.
Fuzzy lookup
– match references to objects– list of all objects is given
– [SDM’05], [TODS’06]
Fuzzy grouping
– group references that co-refer
– [IQIS’05], [JCDL’07]
10
Structure of the Talk
• Motivation Generic Framework
– High-level
• Approach– Part of the Framework
• Experiments
11
Traditional Approach to Entity Resolution
"J. Smith"
f2
f3
?
?
?
Yf2
f3
X
Traditional MethodsFeatures and Context
"Jane Smith"
s (X,Y) = f (X,Y) Similarity = Similarity of Features
12
Key Observation: More Info is Available
A "nice" regular DatabaseJane Smith
John Smith
J. Smith
=
13
Solution: Main Idea
f1
f2
f3
?
?
?
f4
Y
f1
f2
f3
f4?
X
Traditional Methods
+ X Y
A
B C
D
E F
Relationship Analysis
ARG
features and context
s (X,Y) = c (X,Y) + γ f (X,Y)Similarity = Similarity of Features + “Connection Strength”
New Paradigm
14
Illustrative Example
“Indirect connections”– Suppose your co-worker’s name is “John White”– Suppose you see on the Web, on my homepage
– My name: “Dmitri …”– Somebody named: “John White”
– Who is the “John White”?– From data you might establish a connection:
– “Dmitri” might be connected to more “John White”’s…
Dmitri
JCDL’07
Visited
<you>
Visited
<your ORG>
WorksAT WorksAT
John White
15
Key Features of the Framework
Our goal is/was to create a framework, such that:– solid theoretic foundation
– lookup
– domain-independent framework
– self-tuning
– scales to large datasets
– robust under uncertainty
– high disambiguation quality
16
Structure of the Talk
• Motivation
• Generic Framework – High-level
Approach– Part of the Framework
• Experiments
17
Approach
• Graph Creation– Entity-Relationship Graph
• Consolidation Algorithm – Bottom-up clustering
• Adaptiveness to data– That is, self-tuning– Supervised learning
• External Data– To improve the quality further– A theoretic possibility
– Not tested yet
18
ER Graph Creation
19
Virtual Connected Subgraph (VCS)
person
publication
department
organization
similarity
regular
Nodes
Edges
VCS
• VCS– Similarity edges form VCSs– Subgraphs in the ER graph
1. “Virtual”– Contains only similarity edges
2. “Connected”– A path between any 2 nodes
3. Completeness– Adding more nodes/edges would violate (1) and (2)
• Logically, the Goal is– Partition each VCS properly
20
Consolidation Algorithm: Merging
21
Self-tuning via Supervised Learning
22
Self-tuning (2)
23
External Knowledge to Improve Quality
24
Structure of the Talk
• Motivation
• Generic Framework – High-level
• Approach– Part of the Framework
Experiments
25
Quality
“Context” is proposed in [Bhattacharya et al., DMKD’04]
The two algos are proposed in [Dong et al., SIGMOD’05]
26
Scalability & Efficiency
27
Impact of Random Relationships
28
Contact Information
• Info about our disambiguation project– http://www.ics.uci.edu/~dvk
• Overall design– Dmitri V. Kalashnikov– dvk [at] domain
• Implementation details in JCDL’07– Zhaoqi (Stella) Chen– chenz [at] domain– domain = ics.uci.edu