Post on 14-Dec-2015
Collective Relational Clustering
Indrajit BhattacharyaIndrajit BhattacharyaAssistant ProfessorDepartment of CSA
Indian Institute of Science
Relational DataRelational Data
Recent abundance of relational (‘non-iid’) datao Internet o Social networks o Citations in scientific literatureo Biological networkso Telecommunication networkso Customer shopping patternso …
Various applicationso Web Miningo Online Advertising and Recommender Systemso Bioinformaticso Citation analysiso Epidemiologyo Text Analysiso …
Clustering for Relational DataClustering for Relational Data
Lot of research in Statistical Relational Learning over the last decadeo Series of focused workshops in premier conferenceso Confluence of different research areas
Recent focus of unsupervised learning from relational datao Regular papers in premiere conferenceso Recent Book: Relational Data Clustering: Models,
Algorithms, and Applications, Bo Long, Zhongfei Zhang, Philip S. Yu, CRC Press 2009
Traditional vs Relational ClusteringTraditional vs Relational Clustering
Traditional clustering focuses on ‘flat’ datao Cluster based on features of individual objects
Relational clustering additionally considers relationso Heterogeneous relations across objects of different typeso Homogeneous relations across objects of the same type
Naïve solution: Flatten data, then clustero Loss of relational and structural informationo No influence propagation across relational chainso Cannot discover interaction patterns across clusters
Collective relational clustering looks to cluster different data objects jointly
Early Instances of Relational ClusteringEarly Instances of Relational Clustering
Graph Partitioning Problem o Single type homogenous relational data
Co-clustering Problemo Bi-type heterogeneous relational data
General relational clustering considers multi-type data with heterogeneous and homogeneous relationships
Talk OutlineTalk Outline Introduction
Motivating Application: Entity Resolution over Heterogeneous Relational Data
The Relational Clustering Problem
Quick Survey of Relational Clustering Approaches
Probabilistic Model for Structured Relations
Probabilistic Model for Heterogeneous Relations
Future Directions
Talk OutlineTalk Outline Introduction
Motivating Application: Entity Resolution over Heterogeneous Relational Data
The Relational Clustering Problem
Quick Survey of Relational Clustering Approaches
Probabilistic Model for Structured Relations
Probabilistic Model for Heterogeneous Relations
Future Directions
Application: Entity ResolutionApplication: Entity Resolution
Web data on Stephen Johnson
Application: Entity ResolutionApplication: Entity Resolution
Ind. Researcher
Professor
Media Presenter
Movie Director
Photographer
Administrator
Application: Entity ResolutionApplication: Entity Resolution
Data contains references to real world entitieso Structured entities (People, Products, Institutions,…) o Topics / Concepts (comp science, movies, politics, …)
Aim: Consolidate (cluster) according to entitieso Entity Resolution: Map structured references to entitieso Sense Disambiguation: Group words according to senseso Topic Discovery: Group words according to topics or
concepts
Relationships for Entity ResolutionRelationships for Entity Resolution
Movie Director
Photographer
Each document or structured record is a (co-occurrence) relation between references to persons, places, organizations, concepts, etc.
Relational Network Among EntitiesRelational Network Among Entities
Stephen Johnson
Stephen Johnson
Stephen Johnson
Stephen Johnson
Stephen Johnson
Alfred Aho Jeffrey Ullman
Bell Labs
Comp. Sc.
Prog. Lang.
Mark Cross
Chris Walshaw
Univ of Greenwich
HPC
Photography
Ansel AdamsCinema
Direction
Peter Gabriel
White House EPA
George W. Bush
Government
Media
Music
BBC
Stephen Johnson
Entertainment
Leeds University
Using the Network for ClusteringUsing the Network for Clustering
Given the network, find the assignment of data items or references to these entitieso Collective cluster assignment
Find a “nice” network of entities with regularities in the relational structureo Researchers collaborate with colleagues on similar
topicso People send emails to colleagues and friends
Collective Cluster Assignment: Collective Cluster Assignment: ExampleExample
Stephen JohnsonS JohnsonSC Jonshon
Alfred AhoA AhoA V Aho
Jeffrey UllmanJ. UllmanJ D Ullman
Bell LabsAT&T Bell code
generationgrammarexpression tree Stephen Johnson
Steve JohnsonS JohnsonS P Johnson
Mark CrossM Cross Chris
WalshawChris WalsawC Walshaw
U. GreenwichU. of GWich
ParallelizationStructured MeshCode generation
…To find a minimal match cost, dynamic programming, approach of [A Aho and S Johnson, 76], is used. …
Cluster 1
Cluster 2Cluster 3
Cluster 4Cluster 5
Cluster 11
Cluster 12Cluster 13
Cluster 14
Cluster 15
Regularity in a Cluster NetworkRegularity in a Cluster Network
S. Johnson
S. Johnson
Stephen C. Johnson
S. Johnson
M. G. Everett
M. Everett
Alfred V. Aho
A. Aho
S. Johnson
S. Johnson
Stephen C. Johnson
S. Johnson
M. G. Everett
M. Everett
Alfred V. Aho
A. Aho
M J1 A J2
M 1 1 0 0J1 1 1 1 0A 0 1 1 1J2 0 0 1 1
Cl. 1 has better separation of attributes
Cl. 2 has fewer cluster-cluster relations
M J1 A J2
M 1 1 0 0
J1 1 1 0 0
A 0 0 1 1
J2 0 0 1 1
Clustering 1 Clustering 2
Collective Relational ClusteringCollective Relational Clustering
Goal: Given relations among data items, assign to clusters such that relational neighborhoods of clusters have regularities (in addition to attribute similarities within clusters)
Challenges:o Collective / joint clustering decisions over relational
neighborhoods o Defining regularity in relational neighborhoodso Searching over relational networks
Talk OutlineTalk Outline Introduction
Motivating Application: Entity Resolution over Heterogeneous Relational Data
The Relational Clustering Problem
Quick Survey of Relational Clustering Approaches
Probabilistic Model for Structured Relations
Probabilistic Model for Heterogeneous Relations
Future Directions
Relational Clustering: Different Relational Clustering: Different ApproachesApproaches
Greedy Agglomerative Algorithms o Bhattacharya et al ‘04, Dong et al ‘05
Information Theoretic Methodso Mutual Information (Dhillon et al ’03), o Information Bottleneck (Slonim & Tishby ’03), o Bregman Divergence (Merugu et al ‘04, Merugu et al
’06)
Matrix Factorization Techniqueso SVD, BVD, (Long et al ‘05, Long et al ’06)
Graph Cutso Min Cut, Ratio Cut, Normalized Cut, (Dhillon ’01)
Relational Clustering: Relational Clustering: Probabilistic ApproachesProbabilistic Approaches
Models for Co-clusteringo Taskar et al, ‘01; Hofmann et al, ‘98
Infinite Relational Model (Kemp et al, ’06)
Mixed Membership Relational Clustering model (Long et al, ‘06)
Topic Models Extensionso Correlated Topic Models (Blei et al, ‘06)o Grouped Cluster Model (Bhattacharya et al ‘06) o Gaussian Process Topic Models (Agovic & Banerjee, ‘10)
Markov Logic Network (Kok & Domingos, ‘08)
Model for Mixed Relational Data (Bhattacharya et al 08)
Talk OutlineTalk Outline Introduction
Motivating Application: Entity Resolution over Heterogeneous Relational Data
The Relational Clustering Problem
Quick Survey of Relational Clustering Approaches
Probabilistic Model for Structured Relations
Probabilistic Model for Heterogeneous Relations
Future Directions
Modeling Groups of EntitiesModeling Groups of Entities
Bell Labs Group
Alfred V Aho
Jeffrey D Ullman
Ravi Sethi
Stephen C Johnson
Parallel Processing Research Group
Mark Cross
Chris Walshaw Kevin McManus
Stephen P Johnson
Martin Everett
P1: C. Walshaw, M. Cross, M. G. Everett, S. Johnson
P2: C. Walshaw, M. Cross, M. G. Everett, S. Johnson, K. McManus
P3: C. Walshaw, M. Cross, M. G. Everett
P4: Alfred V. Aho, Stephen C. Johnson, Jefferey D. Ullman
P5: A. Aho, S. Johnson, J. Ullman
P6: A. Aho, R. Sethi, J. Ullman
P
LDA-Group ModelLDA-Group Model
R
r
θ
z
a
T
Φ
AV
α
β
Entity label a and group label z for each reference r
Θ: ‘mixture’ of groups for each co-occurrence
Φz: multinomial for choosing entity a for each group z
Va: multinomial for choosing reference r from entity a
Dirichlet priors with α and β
P
LDA-Group ModelLDA-Group Model
R
r
θ
z
a
T
Φ
AV
α
β
Entity label a and group label z for each reference r
Θ: ‘mixture’ of groups for each co-occurrence
Φz: multinomial for choosing entity a for each group z
Va: multinomial for choosing reference r from entity a
Dirichlet priors with α and βReferenceS. Johnson
EntityStephen P Johnson
GroupBell Labs
generatedocument
generatenames
Inference Using Gibbs SamplingInference Using Gibbs Sampling
Approximate inference with Gibbs samplingo Find conditional distribution for any reference given
current groups and entities of all other referenceso Sample from conditional distribution
o Repeat over all references until convergence
When number of groups and entities are known
P(z t )n T
n
n A
ni i
d itDT
d i *DT
aitA T
* tA T
|z ,a,r
P(a a )n A
nS im(r ,v )i i
a tA T
* tA T i a
i
|z,a ,r
P(a a )n
N (r ,v )i new i* tA T i anew
|z,a ,r
Hidden name for a new entity equally prefers all observed references
Non Parametric Entity ResolutionNon Parametric Entity Resolution
Number of entities not a parameter o Allow number of entities to grow with data
For each reference choose any existing entity, or a new entity anew
Faster Inference: Split-Merge SamplingFaster Inference: Split-Merge Sampling
Naïve strategy reassigns data items individually
Alternative: allow clusters to merge or split
For cluster ai, find conditional probabilities for1. Merging with existing cluster aj
2. Splitting back to last merged clusters3. Remaining unchanged
Sample next state for ai from distribution
O(n g + e) time per iteration compared to O(n g + n e)
ER: Evaluation DatasetsER: Evaluation Datasets
CiteSeero 1,504 citations to machine learning papers (Lawrence et
al.)o 2,892 references to 1,165 author entities
arXivo 29,555 publications from High Energy Physics (KDD Cup’03)o 58,515 refs to 9,200 authors
Elsevier BioBaseo 156,156 Biology papers (IBM KDD Challenge ’05) o 831,991 author refso Keywords, topic classifications, language, country and
affiliation of corresponding author, etc
ER: Experimental EvaluationER: Experimental Evaluation
LDA-ER outperforms baselines in all datasetso A - Same entity to refs with attr similarity over a thresholdo A* - Transitive closure over decisions in A
Baselines require threshold as parametero Best achievable performance over all thresholds
LDA-ER does not require similarity threshold
CiteSeer ArXiv BioBase
A 0.980 0.976 0.568
A* 0.990 0.971 0.559
LDA- ER 0.993 0.981 0.645
ER: Trends in Semi-Synthetic DataER: Trends in Semi-Synthetic Data
Bigger improvement with obigger % of ambiguous refsomore refs per co-occurrenceomore neighbors per entity
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5
Percentage of ambiguous attributes
F1
A A* LDA-ER
0.75
0.8
0.85
0.9
2.25 2.5 2.75 3 3.25 3.5 3.75 4
avg #references / hyper-edge
F1
A A* LDA-ER
0.8
0.85
0.9
0 1 2 3 4 5 6 7 8
avg # neighbors / entity
F1
A A* LDA-ER
Talk OutlineTalk Outline Introduction
Motivating Application: Entity Resolution over Heterogeneous Relational Data
The Relational Clustering Problem
Quick Survey of Relational Clustering Approaches
Probabilistic Model for Structured Relations
Probabilistic Model for Heterogeneous Relations
Future Directions
In a document collection, which names refer to the same entities?
Entity Resolution over a Document Entity Resolution over a Document CollectionCollection
Harrison Ford is a resourceful person who stay out of reach to the marshal. David Toohy has written some interesting plots and chases
When it comes to create a universe George Lucas is undisputed leader. Harrison Ford has done justice and special effects are superb.
Lucas script seemed funny enough. It was a fairly good movie with couple of laughs. There was not much story but Ford was good.
Harrison Ford the adventurer is it in yet another quest. To find his father who is in search of the Holy Grail. George Lucas has done a wonderful job.
Jointly Modeling the Textual ContentJointly Modeling the Textual Content
• Words are indicative of the concept entities • Concept entities are related to person entities
Harrison Ford is a resourceful person who stay out of reach to the marshal. David Toohy has written some interesting plots and chases
When it comes to create a universe George Lucas is undisputed leader. Harrison Ford has done justice and special effects are superb.
Lucas script seemed funny enough. It was a fairly good movie with couple of laughs. There was not much story but Ford was good.
Harrison Ford the adventurer is it in yet another quest. To find his father who is in search of the Holy Grail. George Lucas has done a wonderful job.
Document words belong to two categorieso References to structured entitieso References to (unstructured) concept entities
Collectively determine clusters for both types of entities
Relational patterns over two types of entities
Simplifications for learning Observed domain of entities w/ structured attributes Observed relationships between domain entities and
categories for constructing relational neighborhoods
Relational Clustering Relational Clustering Over Structured and Unstructured DataOver Structured and Unstructured Data
c
t
e
a
w
n
m
N
Generative Model for Documents Generative Model for Documents from Structured Entitiesfrom Structured Entities
Generate N reviews one by one
First choose a genre, say Action
Choose an Action movie, say Indiana Jones
Generate n mentions for movieo Choose movie attribute, say Actoro Get attribute value, say Harrison Fordo Generate mention for attribute value
Harrison Ford Ford
Generate m Action words o adventurer, quest, justice …
P(t) : Prior over genres P(e | t) : Movies for genre P(w | t) : Words for genre P(c) : Prior over movie attributes
Movie Reviewso 12,500 reviews: First 10 reviews for top 50 movies for 25 genres
Structured Movie Database from IMDBo 26,250 movies: Top 1250 movies from 25 genres + 25,000 otherso Movie table with 7 columns, but no movie name columno Genre + Top 2 actors, actresses, directors, writers
Entity Identification Baselineo Aggregate similarity over all mentions to score entity for doc o Does not use unstructured words in document
Document Classification Baselineo SVM-Light with default parameterso Uses all words in the document, including structured mentions
Entity Identification: EvaluationEntity Identification: Evaluation
Ent-Id: Experimental Results on Ent-Id: Experimental Results on IMDBIMDB
0
20
40
60
80
100
0 10 20 30 40 50 60 70 80 90 100
% Training Data
Do
c C
at A
ccu
racy
DC-BaseJM
Baseline catches up with joint model only when 35% docs provided for training
Improvement in ent-id accuracy Significant drop in entropy over
entity choices
EI Accuracy EI EntropyJM 40.80% 0.67%EI-Base 38.50% 2.359
Ent-Id: Results on Semi-Synthetic DataEnt-Id: Results on Semi-Synthetic Data
0
0.2
0.4
0.6
0.8
1
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0Genre overlap p0
Do
c C
at A
ccu
racy
JMDC-Base
0
0.2
0.4
0.6
0.8
1
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
Genre overlap p0
En
tity
Id a
ccu
racy
JMEI-Base
Ent-Id improves from 38% to 60% for medium overlap and to 70% when words clearly indicate genre
80% training data for baseline, none for JM
Joint model outperforms baseline for large overlap between genres
Future DirectionsFuture Directions
Handling uncertain relationso Coupling with information extraction
Modeling the cluster network Regularization for networks
Scalable inference mechanisms
Incorporating domain knowledge and user interactiono Semi-supervisiono Active learning
ReferencesReferences A Agovic and A Banerjee., Gaussian Process Topic Models, UAI 2010 S Kok and P Domingos, Extracting Semantic Networks from Text via Relational
Clustering, ECML 2008 I Bhattacharya, S Godbole, and S Joshi, Structured Entity Identification and
Document Categorization: Two Tasks with One Joint Model, SIGKDD 2008 I Bhattacharya and L Getoor, Collective Entity Resolution in Relational Data,
ACM-TKDD, March 2007 A Banerjee, S Basu, S Merugu, Multi-Way Clustering on Relation Graphs, SIAM
SDM 2007 B Long, M Zhang, P S Yu, A Probabilistic Framework for Relational Clustering,
SIGKDD 2007 D Zhou, J Huang, B Schoelkopf, Learning with hypergraphs: Clustering,
classification, and embedding, NIPS 2007 B Long, M Zhang, X Wu, P S Yu, Spectral Clustering for Multi-type Relational
Data, ICML 2006 I Bhattacharya and L Getoor, A Latent Dirichlet Model for Unsupervised Entity
Resolution, SIAM SDM 2006 X Dong, A Halevy, J Madhavan, Reference reconciliation in complex information
spaces, SIGMOD 2005 I Bhattacharya and L Getoor, Iterative Record Linkage for Cleaning and
Integration, SIGMOD–DMKD, 2004 B Taskar, E Segal, D Koller, Probabilistic Classification and Clustering in
Relational Data, IJCAI 2001
Backup SlidesBackup Slides
P1: “JOSTLE: Partitioning of Unstructured Meshes for Massively Parallel Machines”, C. Walshaw, M. Cross, M. G. Everett, S. Johnson
P2: “Partitioning Mapping of Unstructured Meshes to Parallel Machine Topologies”, C. Walshaw, M. Cross, M. G. Everett, S. Johnson, K. McManus
P3: “Dynamic Mesh Partitioning: A Unied Optimisation and Load-Balancing Algorithm”, C. Walshaw, M. Cross, M. G. Everett
P4: “Code Generation for Machines with Multiregister Operations”, Alfred V. Aho, Stephen C. Johnson, Jefferey D. Ullman
P5: “Deterministic Parsing of Ambiguous Grammars”, A. Aho, S. Johnson, J. Ullman
P6: “Compilers: Principles, Techniques, and Tools”, A. Aho, R. Sethi, J. Ullman
Entity Resolution From Structured Entity Resolution From Structured RelationsRelations
Stephen Johnson
Alfred Aho Jeffrey Ullman
Bell Labs Prog. Lang.
Stephen Johnson
Mark Cross
Chris Walshaw
Univ of GreenwichHPC
LDA-ER Generative Process: IllustrationLDA-ER Generative Process: Illustration
For each paper p:1. Choose θp 2. For each author
Sample z from θp Sample a from Φz Sample r from Va
P5
θP5 = [ p(G1)=0.1, p(G2)=0.9 ]
z=G2
a=Aho
ΦG2
Walshaw Johnson1 McManus Cross Everett Ullman Aho Sethi Johnson2
G2G1 ΦG1
0.2 0.2 0.2 0.2 0.2 0.3 0.3 0.2 0.2
ΦG2
r=A.Aho
VA
G2
U
ΦG2
J.Ullman
VU
G2
J2
ΦG2
S.Johnson
VJ2
S C JohnsonStephen C Johnson S Johnson
VJ1=StephenP Johnson
0.04 0.04 0.90
Generating References from EntitiesGenerating References from Entities
Entities are not directly observed
1. Hidden attribute for each entity2. Similarity measure for pairs of attributes
A distribution over attributes for each entity
S C JohnsonStephen C Johnson S Johnson Alfred Aho M. Cross
Stephen C Johnson
0.2 0.6 0.2 0.0 0.0
ER: Performance for Specific NamesER: Performance for Specific Names
Significantly larger improvements for ‘ambiguous names’
NameBest F1 for
A/A*
F1 for
LDA- ER
cho_h 0.80 1.00
davis_a 0.67 0.89
kim_s 0.93 0.99
kim_y 0.93 0.99
lee_h 0.88 0.99
lee_ j 0.98 1.00
liu_ j 0.95 0.97
sarkar_s 0.67 1.00
sato_h 0.82 0.97
sato_t 0.85 1.00
shin_h 0.69 1.00
veselov_a 0.78 1.00
yamamoto_k 0.29 1.00
yang_z 0.77 0.97
zhang_r 0.83 1.00
zhu_z 0.57 1.00
Simplifying the problem: Entity Simplifying the problem: Entity IdentificationIdentification
Assume database on entities availableo IMDB movie databaseo DBLP, PubMed paper databaseo Customer databases in companies
Movie Name Actor Writer Genre Rating
1Indiana Jones and the Last Crusade
Harrison Ford George Lucas Adventure Excellent
2 American Graffiti Harrison Ford George Lucas Comedy Average
3Star Wars: Return of the Jedi
Harrison Ford George Lucas Sci-Fi Excellent
4 Fugitive Harrison Ford David Twohy Action Good
• Not enough information to disambiguate
• Noise in entity mentions
Entity Identification: Still DifficultEntity Identification: Still Difficult
Harrison Ford is a resourceful person who stay out of reach to the marshal. David Toohy has written some interesting plots and chases
When it comes to create a universe George Lucas is undisputed leader. Harrison Ford has done justice and special effects are superb.
Lucas script seemed funny enough. It was a fairly good movie with couple of laughs. There was not much story but Ford was good.
Harrison Ford the adventurer is it in yet another quest. To find his father who is in search of the Holy Grail. George Lucas has done a wonderful job.
American Graffiti : Harrison Ford, George Lucas
Indiana Jones and the Last Crusade : Harrison Ford, George Lucas
Star Wars: Return of the Jedi : Harrison Ford, George Lucas
?
??
Fugitive: Harrison Ford, David Twohy
Categorization and Entity Identification help each other
Classifier predicts additional attributes from document for use in entity identificationo Classifiers for Genre, Rating, Country of the movie …
Entity identification creates labeled data for training the classifiero Reviews tagged with movies labeled with Genre, Rating, etc
The IntuitionThe Intuition
Problem FormulationProblem Formulation
Movie Name Actor Writer Genre Rating
1Indiana Jones and the Last Crusade
Harrison Ford George Lucas Adventure Excellent
2 American Graffiti Harrison Ford George Lucas Comedy Average
3Star Wars: Return of the Jedi
Harrison Ford George Lucas Sci-Fi Excellent
4 Fugitive Harrison Ford David Twohy Action Good
columns C
entities E
Harrison Ford is a resourceful person who stay out of reach to the marshal. David Toohy has written some interesting plots and chases
• Structured mentions derived from column values
• Unstructured words determined by type value
type column T
Problem: Find the central entity for each document and categorize the documents according to type values
• Unobserved central entity for each document
)|(),|()()|()(
)|,,,,()|(
iiiiiiiii
iiiiii
twPceaPcPtePtP
wacetPdP
Traditional entity identification only considers structured mentions as evidence
Here, words suggest type values, and entities relevant for those types get priority
Formalizing the IntuitionFormalizing the Intuition
)|(),|()()|()(
)|,,,,()|(
iiiiiiiii
iiiiii
twPceaPcPtePtP
wacetPdP
Traditional entity identification only considers structured mentions as evidence
Traditional document categorization only considers words as evidence
Here, words suggest type values, and entities relevant for those types get priority
Mentions suggest entities, and type values relevant for those entities get priority
Formalizing the IntuitionFormalizing the Intuition
Infer hidden entity and type value from observed words and references for each document
Initialize posteriors using entity references only
Restrict assignment space for tractability
Unsupervised EM for InferenceUnsupervised EM for Inference
Objective FunctionObjective Function
Greedy agglomerative clustering step: merge cluster pair with max reduction in objective function value
Common cluster neighborhood
Similarity of attributes
weight for attributes
weight for relations
similarity ofattributes
1 iff relational edge exists between ci and
cj
iA A i j
jR i jw sim c c w c c ( , ) ( , )
Minimize:
( , ) ( , ) ( | ( ) | | ( ) |)c c w sim c c w N c N ci j A A i j R i j
Collective Relational Clustering AlgorithmCollective Relational Clustering Algorithm
1. Find similar references using ‘blocking’2. Bootstrap clusters using attributes and relations3. Compute similarities for cluster pairs and insert
into priority queue
4. Repeat until priority queue is empty5. Find ‘closest’ cluster pair6. Stop if similarity below threshold7. Merge to create new cluster8. Update similarity for ‘related’ clusters
O(n k log n) algorithm w/ efficient implementation