Fall 2020 Entity Resolution in the Web of Datahy562/lectures20/lec09_Entity... · 2020. 11. 23. ·...
Transcript of Fall 2020 Entity Resolution in the Web of Datahy562/lectures20/lec09_Entity... · 2020. 11. 23. ·...
Fall 2020
V. Efthymiou, V. Christophides{vefthym|christop}@csd.uoc.gr
http://www.csd.uoc.gr/~hy562
University of Crete, Fall 2020
Entity Resolution inthe Web of Data
Fall 2020
Describing and Linking Entities: Entity-Centric Applications &
Knowledge Bases
Fall 2020
Why “Entities”
2
Fall 2020
Why “Entities”
3
Fall 2020
“Entities” is What a Large Part of Our Knowledge is About
4
Time in Bilbao Bilbao Population Popular schools in Bilbao
Weather in Bilbao
Bilbao places to go
Bilbao photos
Bilbao map
Fall 2020
Core Entities
5
Locations
Movies
Organizations
Persons
Fall 2020
What Is A Knowledge Base (KB)?
Comprehensive, machine-readable descriptions of real-world entities are hosted in knowledge bases (KB)
Entity names, types, attributes, relationships, provenance info
Entities are described as instances of one or several conceptual types and may be linked through relationships
The core Semantic Web data model
6
dbpedia:A_Clockwork_Orange_(film)
dbo:director dbpedia:Stanley_Kubrick
dbo:Work/runtime
“136”
foaf:name “A Clockwork Orange”
dbpedia:Stanley_Kubrick
dbo:birthPlace
dbpedia:Manhattan
rdf:type foaf:Person
rdf:type yago:AmericanFilmDirectors
rdf:type yago:AmateurChessPlayers
facts
properties/attributes values
Fall 2020
Knowledge Bases: Scope
Domain-specific Knowledge Bases
Focus on a well-defined domain:
IMDB for movies, Music-Brainz for music, GeoNames for geo, CIA World Factbook for demographics, SNOMED CT for medicine etc.
Cross-domain Knowledge Bases
Cover a variety of knowledge across domains
Concept-centric (Intensional): Cyc, WordNet
Entity-centric (Extensional): DBpedia, Wikidata, Yago, Freebase, Google’s Knowledge Graph, Satori Bing, Knowledge Vault
7Knowledge Curation & Knowledge Fusion: Challenges, Models, and Applications X. L. Dong D. Srivastava SIGMOD'15 Tutorial 8
Fall 2020
Knowledge Bases: History
1985 1990 2000 2005 2010
Cyc
x: human(x)
( y: mother(x,y)
z: father(x,z))
x,u,w: (mother(x,u)
mother(x,w)
u=w)
WordNet
guitarist {player,musician}
artist
algebraist
mathematician
scientist
Wikipedia
4.5 Mio. English articles
20 Mio. contributors
from humansfor humans
from algorithmsfor machines
Adapted from Knowledge Bases in the Age of Big Data Analytics F. Suchanek, WeikumVLDB 2014 Tutorial
Knowledge Graph
9
Fall 2020
The Long Tail of Knowledge
Freebase is large, but still very incomplete:
We need automatic KB construction methods
9
Relation % unknownin Freebase
Profession 68%
Place of birth 71%
Nationality 75%
Education 91%
Spouse 92%
Parents 94%
Kevin Murphy From Big Data to Big Knowledge CIKM 2013 industry talk cikm2013.org/slides/kevin.pdf
Fall 2020
Google’s Knowledge Graph (GKG)
600M nodes (entities)
1.5K entity types
20B edges (triples)
35K properties
10
▪ The Knowledge Graph is used by Google to enhance its search engine's search results with semantic-search information
– GKG is rooted in public KBs such as Freebase, Wikipedia and the CIA World Factbook
– It’s also augmented at a much larger scale focusing on a comprehensive breadth and depth of entity descriptions
Fall 2020
Google Knowledge Vault (GKV)
A. Bordes, E. Gabrilovich Constructing and Mining Web-scale Knowledge Graph KDD 2014 Tutorial11
Evaluate trustworthiness of sources by the correctness of its factual information
Probabilistic Knowledge Fusion
Fall 2020
Stanley Kubrick
Entity descriptions
Resolved entities
Entity-Centric Processing
lived-in
Entity-centric search
Entity-centric recommendation
Entity-Centric Applications
12
entityfacts
spouse
birth place
director
death placespouse
death place
spouse
studio
entityrelationships
Fall 2020
https://finnaarupnielsen.wordpress.com/tag/wikidata/
Citizen Journalism: Occupations of Persons from Panama Papers
Fall 2020
Entity-Centric Information Processing
Automated construction of entity descriptionsInformation extraction: extract new entities from web/text
Link prediction: add relationships among entities
Entity integration and resolutionKnowledge base integration: instance & ontology mappings
Entity resolution: merging or splitting similar entities
Entity-centric access interfacesAugmented search: interpret the meaning of queries using entities and compute
answers based on a knowledge base
Entity-based matching: recommend new entities given an entity, a user or a query
Entity-centric summarization: of textual posts in social media
14
Fall 2020
Describing and Linking Entities: Linked Knowledge Bases &
The Web of Data
Fall 2020
The Web of Data
16adapted from Chris Bizer, Richard Cyganiak, Tom Heath, http://linkeddata.org/guides-and-tutorials
Web of Data
Conceptual Representation
entity entity entityentity
Typed Links Typed Links Typed Links
Spreadsheets
HTMLXMLRDFa
UR
Is
UR
Is
UR
Is
UR
Is
UR
Is UR
Isrepresent
SemiStructuredTriplesStatistical
represent represent represent
A Web of things in the world (aka entities),
described by data on the Web
Fall 2020
From Open to Linked Data
17
Fall 2020
The Linked Data Principles
Anyone can publish data on the Web regarding real-world entities by respecting a minimal set of syntactic conventionsUse URIs as names for thingsUse HTTP URIs so that people and machines
can look up those entitiesWhen someone looks up a URI,
provide a useful descriptionInclude links to other URIs, so that
they can discover more things
Data becomes self-describingApplications encountering data described by an unfamiliar vocabulary, they can
resolve its URIs and understand the vocabulary terms by their RDFS or OWL definitions
18
Fall 2020
Linking Entities in KBs
19
dbpedia:A_Clockwork_Orange_(film)
dbo:director dbpedia:Stanley_Kubrick
dbo:Work/runtime
“136”
foaf:name “A Clockwork Orange”
dbpedia:Stanley_Kubrick
dbo:birthPlace
dbpedia:Manhattan
rdf:type foaf:Person
rdf:type yago:AmericanFilmDirectors
rdf:type yago:AmateurChessPlayers
lmdb:director/8476
lmdb:director_name
“Stanley Kubrick”
rdf:type foaf:Person
foaf:made lmdb:film/1894
foaf:made lmdb:film/2014
foaf:made lmdb:film/2685
fbase:m.05ldxl
fbase:film.directedBy lmdb:director/8476
fbase:film_cut/runtime “136”
foaf:name “A Clockwork Orange”
dbpedia:Manhattan
rdfs:label “Manhattan”
rdf:type schema.org:Place
dbo:populationTotal
1,626,159
Fall 2020
From Data Silos to the Web of Data
Thanks to W3C standards we can support:A uniform and universal access to informationEasy linking and re-usage of information in different contextsNetwork effects in adding value to information
Data Interoperability at the Web ScaleConnecting data from diverse sources and across domains
OpennessMany common things are represented in multiple sourcesDiscover new sources at run-time by following links
20
Fall 2020
The Linked Open Data (LOD) Cloud
21
Media
Government
Geographic
Publications
User-generated
Life sciences
Cross-domain
Social networking
Linguistics
1014 Knowledge Bases60B Triples
649 vocabularies
Last updated: 2014-08-30 http://lod-cloud.net/
Fall 2020
Linked Open Vocabularies (LOV)
http://lov.okfn.org/dataset/lov22
Fall 2020
adapted from Suchanek & Weikum tutorial@SIGMOD 2013
LOD Cloud and KBs
23
Fall 2020
LOD KBs: Entity Statistics
24
685 entity types 4.58 M entities5.83 M triples2.795 K properties
1.5 K entity types46.3 M entities2.67 B triples4.5 K properties
350 K entity types 10M entities120M triples100 properties
adapted from Suchanek & Weikum tutorial@SIGMOD 2013
English version
52 entity types503K entities6M triples
123 entity types1.9 B entities9.6 B triples112 properties
Fall 2020
LOD KBs: Entity Interlinking
25
Only 56.11% of the KBs link to at least another KB
17.36% of them link to only one other KB (typically DBpedia)
The LOD cloud diagram is sparse
top KBs (wrt. in-degree)
In-degree
DBpedia 207
geonames 141
w3.org 117
Fall 2020
Describing and Linking Entities: Description Quality &
Entity Resolution
Fall 2020
Quality of Entity Descriptions in the Web of Data
Given the open and decentralized nature of the Web, reliability and usability of entity descriptions need to be constantly improvedIncompleteness: real world entitles are only partially described in KBs
Redundancy: descriptions of the same real world entities usually overlap in multiple KBs
Inconsistency: real world entities may have conflicting descriptions across KBs
Incorrectness: errors can be propagated from one KB to the other due to manual copying or automated extraction/fusion techniques
29
Fall 2020
Various Forms of KBs Overlapping
among KBs (inter-overlapping, due to common data sources)
within the same KB (intra-overlapping, due to data extraction or curation problems)dbpedia:Robert_Soloway and dbpedia:Spam_King refer to the same individual,
Robert Soloway, who was nicknamed the “Spam King”
less often than inter-overlaping
30
not identical descriptions, even if they stem from the same source
Fall 2020
Entity Resolution (ER)
The problem of identifying descriptions of the same real-world entity
31
Stanley Kubrick
A Clockwork Orange
somehow similar descriptions
highly similar descriptions
director
Data Variety
Fall 2020
From Highly to Somehow Similar
Highly Similar
feature many common tokens in the values of semantically related attributes
heavily interlinked
mostly using owl:sameAspredicates
usually provide complementary descriptions of the same entity in terms of facets, temporal evolution
good for fusing
typically met in central KBs
Somehow Similar
feature significantly fewer common tokens in attributes that are not always semantically related
sparsely interlinked
using various kinds of predicates
usually complementary descriptions of the same entity in terms of domains, aspects
good for linking
typically met in peripheral KBs
32
Fall 2020
How does ER Improve KB Quality?
KB Completeness: Linking somehow similar descriptions will increase coverage of entity facts
and relationships
KB Conciseness: Merging highly similar descriptions will reduce duplicate entity facts and
relationships
KB Consistency: Matching similar descriptions will enable to detect conflicting assertions
KB Correctness: Splitting complex descriptions will facilitate entity repairing
33
Fall 2020
Web-Scale ER: Challenges
The two core ER problems, how can we
effectively compute entity similarity
efficiently resolve sets of entities within or across KBs
are challenged by the
important number of KBs (~ hundreds)
large number of entity types & properties (~ thousands)
massive volume of entities (~millions)Large-scale, progressive, multi-type, cross-domain ER
Big Data Volume, Velocity and Variety!
34
Fall 2020
Entity Resolution – Formal Definition
Entity resolution: The problem of identifying descriptions of the same entity within or across sources E = {e1, ..., em} is a set of entity descriptions
M : E ×E → {true, false} is a match function
The resolution of entities in E results in a partition P = {p1, ..., pn} of E, such that:
1.∀ei, ej ∈ E : M(ei, ej) = true,∃pk ∈ P : ei, ej ∈ pk
2.∀pk ∈ P, ∀ei, ej ∈ pk, M(ei, ej) = true
each partition contains only matching descriptions
all the matching descriptions are in the same partition
Fall 2020
Entity Resolution - Example
36
dbpedia:Stanley_Kubrick
dbo:birthPlace
dbpedia:Manhattan
rdf:type foaf:Person
rdf:type yago:AmericanFilmDirectors
rdf:type yago:AmateurChessPlayers
lmdb:director/8476
lmdb:director_name
“Stanley Kubrick”
rdf:type foaf:Person
foaf:made lmdb:film/1894
foaf:made lmdb:film/2014
foaf:made lmdb:film/2685
e1
e3
dbpedia:A_Clockwork_Orange_(film)
dbo:director dbpedia:Stanley_Kubrick
dbo:Work/runtime
“136”
foaf:name “A Clockwork Orange”
fbase:m.05ldxl
fbase:film.directedBy lmdb:director/8476
fbase:film_cut/runtime “136”
foaf:name “A Clockwork Orange”
e2 e4
e5
Imdb:co0041067
imbd:location GeoNames:Buckinghamshire
imbd:filmography “A Clockwork Orange”
Fall 2020
Entity Resolution - Example
Assume as input of entity resolution, the set E = {e1, e2, e3, e4, e5}
A possible output P = {{e1, e3}, {e2, e4}, {e5}} indicates that: e1, e3 refer to the same real-world person, the
director Stanley Kubrick
e2, e3 represent a different entity, the movie AClockwork Orange
e5 represents a third thing, the movie studio PineWood
37
e1 e3
e2 e4
e5
Fall 2020
Matching Graph-structured Entities
38
dbpedia:Stanely_Kubrick
freebase:Stanely_Kubrick
dbpedia:A_Clockwork_Orange
linkedMDB:A_Clockwork_Orange
dbpedia:Katharina_Kubrick
dbpedia:The_Shining
freebase:The_Shining
freebase:Ruth_Sobotka
directsdirector
directorchildren director spouse
e1
e2
e3
e4
e5 e6e7 e8
Fall 2020Single (Pairwise) Entity Matching Based on Content Similarity
39
e1
birthPlace Manhattan
type Person
type AmericanFilmDire
ctors
type Amateur
ChessPlayers
e3
director_name “Stanley
Kubrick”
type Person
directs film/1894
directs film/2014
directs film/2685thresh = 0.5
simc(e1,e3) = Jaccard ({Manhattan, Person, AmericanFilmDirectors, AmateurChessPlayers}, {Stanley, Kubrick, Person, 1894, 2014, 2685}) = 0.1
simc: let the content similarity of two descriptions be the Jaccard similarity of their values’ token sets
Matching decisions are independent
Fall 2020
Collective (Joint) Entity Matching Based on Structural Similarity
40
dbpedia:Stanely_Kubrick
freebase:Stanely_Kubrick
dbpedia:A_Clockwork_Orange
linkedMDB:A_Clockwork_Orange
dbpedia:Katharina_Kubrick
dbpedia:The_Shining
freebase:The_Shining
freebase:Ruth_Sobotka
directorchildren director spouse
e1
e2
e3
e4
e5 e6e7 e8
One matching provides evidence for another
director directs
Fall 2020
Matching Neighborhood
41
e2
director e1
Work/
runtime
“136”
name “A Clockwork
Orange”
e4
film.directedBy e3
film_cut/runtime “136”
name “A Clockwork
Orange”thresh = 0.5
simc(e2,e4)= Jaccard ({e1, 136, A, Clockwork, Orange}, {e3, 136, A, Clockwork, Orange}) = 0.66
Fall 2020
Similarity Propagation
42
dbpedia:Stanely_Kubrick
freebase:Stanely_Kubrick
dbpedia:A_Clockwork_Orange
linkedMDB:A_Clockwork_Orange
dbpedia:Katharina_Kubrick
dbpedia:The_Shining
freebase:The_Shining
freebase:Ruth_Sobotka
directsdirector
directorchildren director spouse
e1
e2
e3
e4
e5 e6e7 e8
Fall 2020
Forms of ER
43
sameAs
KB1 KB2
KB1 KB2Record Deduplication: ER with results merging
exploits transitivity of matches
Record Linkage: ER without results merging
exploits exclusivity of matches
Fall 2020
High similarity in structure
Low similarity in structure
High similarity in content
Low similarity in content
ER Forms & Similarity
44
string sim. in the values of specific attsfrom one relation
set sim. in the values of specific attsfrom tworelations
att & value sim. in a network of relations
Fall 2020
ER Settings
Two kinds of entity collections as input:Clean: redundant-free
Dirty: contains redundant entity descriptions
Given two entity collections as input, an ER task can be:Clean-Clean ER: Given two clean, but overlapping entity collections, identify
the common entity descriptions
a.k.a. the record linkage in databasesDirty-Dirty ER: Identify unique entity descriptions contained in the union of
two dirty input entity collections
aka the deduplication problem in databasesDirty-Clean Entity Resolution
45
Fall 2020
What We Will Cover Next
Describing and Linking EntitiesEntity-Centric Applications & Knowledge Bases
Linked Knowledge Bases & The Web of Data
Description Quality & Entity Resolution
Matching and Resolving Entities (I)Entity Similarity (Content & Context)
Blocking Techniques (Token, Attribute & URI)
Block Post-Processing
Matching and Resolving Entities (II)Iterative Resolution Techniques (Merging & Matching)
Progressive Resolution Techniques
Conclusions & Open Issues
46
Fall 2020
Matching and Resolving Entities (I): Entity Similarity
Fall 2020
Similar or Dissimilar?
Fall 2020
Shape Color
Size Pattern
Similarity is Domain Specific
Fall 2020
Entity Resolution - Match
Matches: Sets of entity descriptions that refer to the same real-world entity, intuitively: Matching descriptions are placed in the same partition
All the descriptions of the same partition match
A match function M() maps each pair of entity descriptions (ei, ej)to {true, false} → i.e., match or no matchM(ei, ej) = true => ei, ej are matches
M(ei, ej) = false => ei, ej are non-matches
53
Fall 2020
Match Function: Formal Properties
The match function M() introduces an equivalence relation (owl:sameAs) among entity descriptionsReflexivity: ∀ei ∈ E M(ei, ei) = true
Symmetry: ∀ei, ej ∈ E M(ei, ej) = M(ej ,ei)
Transitivity: ∀ei, ej, ek ∈ E if M(ei, ej) = true and M(ej, ek) = true, then M(ei, ek) = true
Additional properties are useful when linking & repairing entitiesExclusivity (1-1 Assumption): ∀ei , ek ∈ E and ∀ej ∈ E’ if M(ei, ej) = true, then
M(ek, ej) = false
54
Fall 2020
Entity Resolution - Similarity
In practice, the match function is defined via a similarity function sim(), measuring how similar two entity descriptions are to each other, according to certain comparison criteria
Given a similarity threshold :M(ei, ej) = true, if sim(ei, ej) ≥
M(ei, ej) = false, otherwise
ML techniques for automatically learning similarity measures are challenged by a Web-scale entity resolution [Köpcke et al. 2010]adaptive learning techniques require training data for each domain [Tejada et al.
2002] [Bilenko et al. 2003]
active learning techniques (threshold-based Boolean functions or linear classifiers) work well with highly similar descriptions [Arasu et al. 2010]
55
Fall 2020
Entity Similarity: Example
although not identical, e2 and e4 are highly similar
57
dbpedia:A_Clockwork_Orange_(film)
dbo:director dbpedia:Stanley_Kubrick
dbo:Work/runtime
“136”
foaf:name “A Clockwork Orange”
fbase:m.05ldxl
fbase:film.directedBy lmdb:director/8476
fbase:film_cut/runtime “136”
foaf:name “A Clockwork Orange”
e2 e4
dbpedia:Stanley_Kubrick
dbo:birthPlace
dbpedia:Manhattan
rdf:type foaf:Person
rdf:type yago:AmericanFilmDirectors
rdf:type yago:AmateurChessPlayers
lmdb:director/8476
lmdb:director_name
“Stanley Kubrick”
rdf:type foaf:Person
foaf:made lmdb:film/1894
foaf:made lmdb:film/2014
foaf:made lmdb:film/2685
e1
e3
e1 and e3 are at best somehow similar
Fall 2020
The Role of Similarity Functions: A Pragmatic Case
Missed matching pairs of entity descriptions
True matching pairs of entity
descriptions
Pairs of entity descriptions satisfying a
pragmatic similarity function
Set of all pairs of entity
descriptions
Matching pairs of entity
descriptions
Fall 2020
The Role of Similarity Functions: The Ideal Case
Set of all pairs of entity
descriptions
Matching pairs of entity
descriptions
Pairs of entity descriptions
satisfying an idealsimilarity function
Fall 2020
Similarity Metric: Formal Description
A similarity function sim(): Ε × Ε → R is a metric, if for every ei,ej,elE for a given set E, it satisfies the following conditions:
sim(ei,ei) ≥ 0 (positive),
sim(ei,ei) ≥ sim(ei,ej),
sim(ei,ej) = sim(ej,ei) (reflexive),
sim(ei,ei) = sim(ej,ej) = sim(ej,ei) ej=eisim(ei,ej)+sim(ej,el) ≤ sim(ei,el)+sim(ej,ej) (triangle inequality)
Normalized similarity: sim(ei,ei)∈[0,1]
sim(ei,ei) = 1 for exact match
sim(ei,ej) = 0 for “completely” different ei and ei0 < sim(ei,ej) < 1 for some approximate similarity
Any similarity metric can be transformed to a distance metric and vice versa
sim(ei,ej)= 1 – dist(ei,ej) 60
Fall 2020
Euclidian vs Non-Euclidian Distances
A Euclidean space: has some number of real-valued dimensions and dense pointsThere is notion of average of two points
A Euclidian distance is based on the locations of points in such space (L2-norm, L1-norm)
A Non-Euclidian distance is based on properties of points, but not on their “location” in space Jaccard, Cosine, Edit, Hamming distance
61http://dataconomy.com/implementing-the-five-most-popular-similarity-measures-in-python/
Fall 2020
Matching and Resolving Entities (I): Content-based Entity Similarity
Fall 2020
In Search of Entity Similarity Measures
Defining similarity functions that satisfy the formal properties of metric spaces is, in practice, too restrictive for non-geometric models
Two main families of similarity measures for resolving entity descriptions in the Web of data
Content-based: mostly for measuring string similarity of attribute values in pairs of entity descriptions character-based, token-based
Context-based: exploit similarity of neighbor descriptions via different entity relationships tree-based, graph-based
63
Fall 2020
String Similarity Measures
Record Linkage: Similarity Measures and Algorithms N. Koudas S. Sarawagi D. Srivastava SIGMOD'06 Tutorial
Dice
PhoneticDomain dependent (Rules, DataTypes)
Good for Text-intensivedescriptionsGood for Names
Good for abbreviations,nick names
64
Fall 2020
Tokenizing Strings: Separators
Forming words from sequence of characters: Surprisingly complex in English, can be harder in other languages
General idea: Separate string into tokens using some separatorSpace, hyphen, punctuation, special characters
Usually also convert to lowercase
Problems: Sometimes hyphens should be considered either as part of the word or a word
separator: e.g., Spanish-speaking
Apostrophes can be a part of a word, a part of a possessive, or just a mistake: e.g., master's degree
Periods can occur in numbers, abbreviations, URLs, ends of sentences, and other situations: e.g., F.M.I
65Similarity measures Felix Naumann 11.6.2013
Fall 2020
Tokenizing Strings: n-grams
Split string into short substrings of length n
Sliding window over stringn=2: Bigrams
n=3: Trigrams
Variation: Pad with n – 1 special charactersEmphasizes beginning and end of string
Variation: Include positional information to weight similaritiesNumber of n-grams = |x| – n + 1
Count how many n-grams are common in both strings
66Similarity measures Felix Naumann 11.6.2013
Fall 2020
Token-based Entity Similarity
67
name Eiffel Tower
architect Sauvestre
year 1889
location Paris
name Statue of
Liberty
architect Bartholdi Eiffel
year 1886
located NY
about Lady liberty
architect Eiffel
location NY
about Eiffel Tower
architect Sauvestre
year 1889
located Paris
name White Tower
location Thessaloniki
year-
constructed
1450
e1e2
e3
e4
e5
Jaccard(e1,e3) = 1/8Jaccard(e1,e4) = 1Jaccard(e1,e5)= 1/8
Jaccard(e2,e3) = 3/7Jaccard(e2,e4) = 1/11Jaccard(e2,e5) = 0/11
Jaccard( tokens(ei), tokens(ej) ) = | tokens(ei) ∩ tokens(ej) || tokens(ei) tokens(ej) |
Fall 2020
Matching and Resolving Entities (I): Context-based Entity Similarity
Fall 2020
Similarity of Hierarchical Data
69
ID Actor Film
S1 Al Pacino F1
S2 Al Pacino F2
S3 Marlon Brando F2
ID Name Year Rating
F1 The Godfather 1972 9.2
F2 Gottvatter, The 72
m1,movie
t1,title s1,set
a11,actor a12,actorTroy
Brad Pitt Eric Bana
m2,movie
t2,title s2,set
a21,actor a22,actorTroja
Brad Pit Erik Bana
a23,actor
Brian Cox
y1,year
2004
y2,year
04
▪ Relational star / snowflake schema [Ananthakrishna et al. 2002]
▪ Hierarchical XML data [Calado et al.2010]
single branch
multiple branch
MovieCasting
Fall 2020
DELPHI Containment Metric [Ananthakrishna et al. 2002]
Hybrid measure considering both similarity of attribute values (tcm)
and similarity of children sets reached by foreign keys (fkcm)
Similarity of attribute values
Divide tuples into tokens → token sets TS
Compute the edit distance between the tokens of different sets
Determine weight of each token via inverse document frequency
Thus the idf of a rare term is high, whereas the idf of a frequent
term is likely to be low
The token similarity metric tcm measures which fraction of one tuple T is covered
by the other tuple T’ in a relation of interest Ri
70
Fall 2020
DELPHI Containment Metric [Ananthakrishna et al. 2002]
Similarity of children sets
The children set CS of a tuple T includes all tuples of a relation Rj referencing Tfrom a relation Ri by means of a foreign key
Foreign-key containment metric (fkcm) measures at what extent the children set
of a tuple T is covered by the children set of a tuple T’
71
Fall 2020
DELPHI Containment Metric [Ananthakrishna et al. 2002]
Combining tcm and fkcm
Both tcm and fkcm are assigned an IDF weight
Use of a classification function:pos(x) = 1 if x > 0,
-1 otherwise
Threshold for tcm: s1
Threshold for fkcm: s2
Classification of pairwise comparison between T and T‘ using
• If final result equals 1, then match, otherwise non-match
72
Fall 2020
DELPHI Containment Metric: Example
73
1. Token sets:
TS(F1) = {The, Godfather, 1972, 9.2}
TS(F2) = {Gottvatter, The, 72}
2. Attribute value similarities
The = The, Godfather = Gottvatter,
1972 = 72.
3. Weights
For simplification, we assume all
tokens have equal weight
4. Token containment metric
tcm(F1,F2) = ¾, tcm(F2,F1) = 1
5. Children co-occurrence
fkcm(F1,F2) = 1, fkcm(F2,F1) = ½
6. Combination of both metrics
(s1 = s2 = 0.5, weights = 1)
pos( pos(3/4 - 0.5) + pos(1 - 0.5) = 1
→ F1 and F2 match
ID Actor Film
S1 Al Pacino F1
S2 Al Pacino F2
S3 Marlon Brando F2
ID Name Year Rating
F1 The Godfather 1972 9.2
F2 Gottvatter, The 72
Fall 2020
Graph-Context Similarity
Assign a score sx y to each pair of entities x and y two descriptions are similar if their structural neighborhood is similar (hide
attribute values)
Local similarity Indices Common neighbors (CN)
where is the set of neighbors Jaccard: Normalized common neighbors (NCN)
where is the set of neighbors
Adamic-Adar (AA):
where is the degree of node z
and counts the uniqueness
of z, which is inversely proportional to its number of neighbors74
N
Fall 2020
Content & Structure Similarity: SiGMa [Lacoste-Julien et al 2013]
Even though entities i and j have no tokens in common, the fact that several of their respective neighbors are matched together is a strong evidence that i and jshould be matched togetherUse neighbors for scoring and suggesting candidate pairs 78
Fall 2020
SiGMa Neighbors Similarity
Compatible-neighbors Nij: a neighbor k of i being matched to a compatible neighbor l of j should encourage i to be matched to j
Nij = {(k, l): (i, r, k)KB1 and (j, s, l)KB2 and relationship r is matched to s}
79
Fall 2020
In Search of Entity Similarity Measures
For highly similar entities content similarity (i.e., their attribute values) is sufficient
For somehow similar entities we need to also consider the similarity of the structured context of entities in an iterative way Identifying most discriminating attributes and relationships is helpful
An orthogonal issue is the schematic discrepancy of attributes and relationships employed in the entity descriptions whose hybrid similarity is assessed Simple: either ontological commitments exist or schematic mappings are
provided by the users
Complex: assess similarity of attributes and relationships based on the similarity of their names or values
81
Fall 2020
Matching and Resolving Entities (I): Blocking Techniques
Fall 2020
Typical ER Workflow
Good for high Volumes83
Matching
Recognize pairs of similar entity descriptions
Blocking
Group similar enough entity descriptions
O(n) if hash-based
O(b∙d2) b blocks of d descriptions (avg)O(n2) n entity descriptions
Reduce comparisons leading to non-matching entities
Experiment over 9M entity descriptions in a cluster of 15 VMs:
ER workflow without blocking: >200 hrsER workflow with blocking: 11 hrs
Fall 2020
Token Blocking [Papadakis et al. 2011]
Assume two clean sets KB1, KB2 of entity descriptions free of intra-overlapping (Clean-Clean ER)
Each distinct token ti of values of entity descriptions in KB1 ∪ KB2
corresponds to a blockEach block contains all entity descriptions sharing the corresponding token
Pairs originating from the same (clean) KB are not compared
Token blocking offers a brute-force method for comparing descriptions even if they are highly heterogeneousThe same pair of descriptions is contained in many blocks (redundant
comparisons)
Many dissimilar pairs are put in the same block (unnecessary comparisons)
84
Fall 2020
Token Blocking Example
Redundant comparisons for (e1, e4) contained in 5 different blocks!85
name Eiffel Tower
architect Sauvestre
year 1889
location Paris
name Statue of
Liberty
architect Bartholdi Eiffel
year 1886
located NY
about Lady liberty
architect Eiffel
location NY
about Eiffel Tower
architect Sauvestre
year 1889
located Paris
name White Tower
location Thessaloniki
year-
constructed
1450
Eiffel
e1, e2,
e3, e4
Tower
e1, e4,
e5
Statue
e2
Liberty
e2, e3
White
e5
1889
e1, e4
Bartholdi
e2
NY
e2, e3
Sauvestre
e1, e4
Paris
e1, e4
1886
e2
1450
e5
Lady
e3
Thessaloniki
e5
Generated
Blocks
e1
e2
e3
e4
e5Actually, an inverted
index
Fall 2020
Token Blocking Example
Unnecessary comparisons between (e1, e3), (e2, e4), (e1, e5)86
name Eiffel Tower
architect Sauvestre
year 1889
location Paris
name Statue of
Liberty
architect Bartholdi Eiffel
year 1886
located NY
about Lady liberty
architect Eiffel
location NY
about Eiffel Tower
architect Sauvestre
year 1889
located Paris
name White Tower
location Thessaloniki
year-
constructed
1450
Eiffel
e1, e2,
e3, e4
Tower
e1, e4,
e5
Liberty
e2, e3
1889
e1, e4
NY
e2, e3
Sauvestre
e1, e4
Paris
e1, e4
Generated
Blocks
e1
e2
e3
e4
e5Actually, an inverted
index
Fall 2020
Attribute Clustering Blocking [Papadakis et al. 2013]
Token blocking totally ignores the semantics of attributesWhen attribute mappings are not known, attribute clustering considers similarity
of attributes
Attribute similarity is computed w.r.t. the string similarities of their values
Two main steps:Similar attributes are placed together in non-overlapping clusters
Token blocking is performed on the descriptions of each cluster
87
Fall 2020
Attribute Clustering Blocking: Example
Finding the attribute of KB2 that is the most similar to “about” of KB1: values {Eiffel, Tower, Statue, Liberty, Auguste, Bartholdi, Joan}
compared to (with Jaccard similarity) :values of work: {Lady,Liberty,Eiffel,Tower,Bartholdi,Fountain} → Jaccard = 4/9values of artist: {Bartholdi} → Jaccard = 1/8values of location: {NY, Paris, Washington, D.C.} → Jaccard = 0values of year-constructed: {1889, 1876} → Jaccard = 0
89
about Eiffel Tower
architect Sauvestre
year 1889
located Paris e11
about Statue of Liberty
architect Bartholdi Eiffel
year 1886
located NY e12
about AugusteBartholdi
born 1834 e13
about Joan Tower
born 1938e14
work Lady Liberty
artist Bartholdi
location NYe15
work Eiffel Tower
year-constructed
1889
location Paris
e16
work Bartholdi Fountain
year-constructed
1876
location Washington D.C.
e17
Fall 2020
Attribute Clustering Blocking: Example
90
about Eiffel Tower
architect Sauvestre
year 1889
located Paris e11
about Statue of Liberty
architect Bartholdi Eiffel
year 1886
located NY e12
about AugusteBartholdi
born 1834 e13
about Joan Tower
born 1938e14
work Lady Liberty
artist Bartholdi
location NYe15
work Eiffel Tower
year-constructed
1889
location Paris
e16
work Bartholdi Fountain
year-constructed
1876
location Washington D.C.
e17
KB1 KB2aboutarchitectyearbornlocated
workartistyear-constructedlocation
Fall 2020
Attribute Clustering Blocking: Example
Compute the transitive closure of the generated attribute pairsConnected attributes form clusters
Example: Pairs (about, work), (work, about), (artist, architect), (architect, work)
91
KB1 KB2aboutarchitectyearbornlocated
workartistyear-constructedlocation
aboutwork
architectartist
C1year
year-constructedborn
C2locationlocated
C3
Fall 2020
Token Blocking in Each Attribute Cluster
C3.NY
e12, e15
92
about Eiffel Tower
architect Sauvestre
year 1889
located Paris e11
about Statue of Liberty
architect Bartholdi Eiffel
year 1886
located NY e12
about AugusteBartholdi
born 1834 e13
about Joan Tower
born 1938e14
work Lady Liberty
artist Bartholdi
location NYe15
work Eiffel Tower
year-constructed
1889
location Paris
e16
work Bartholdi Fountain
year-constructed
1876
location Washington D.C.
e17
aboutwork
architectartist
C1year
year-constructedborn
C2locationlocated
C3
→ compare Lady Liberty to Auguste Bartholdi
C1.Tower
e11, e14, e16
C1.Bartholdi
e12, e13, e15, e17
Fall 2020
Other Blocking Techniques
Similarity Join algorithms: relies on an inverted index from the most discriminating (non-frequent) tokens of the descriptions and the prefix filtering principle [Chaudhuri et al., 2006]
prefix filtering is effective only when the similarity threshold is extremely high Frequent itemsets blocking: build blocks for sets of tokens that
frequently co-occur in entity descriptionsmay significantly reduce the number of candidate pairs
may also significantly increase missed matches, especially between descriptions with few common tokens
Multidimensional blocking: constructs a collection of blocks for each similarity function used to resolve entities and aggregates them into a single multidimensional collection, by taking into account the similarities of descriptions that share blocks
96
Fall 2020
Placing Entities in the Same Block
97
Method Criterion
Token Blocking [Papadakis et al.,2011]
The descriptions have a common token in their values
Attribute Clustering Blocking [Papadakis et al., 2013]
The descriptions have a common token in the values of attributes that have similar values in overall
Prefix-Infix(-Suffix) [Papadakis et al., 2012]
The descriptions have a common token in their literal values, or a common URI infix
ppjoin+ [Vernica et al., 2010] The descriptions have a common token in their p first tokens (sorted in ascending frequency order)
Frequent itemsets [Kenig and Gal, 2013]
The descriptions have frequently co-occurring tokens in their values
Fall 2020
Blocking Algorithms Comparison
98
Method (suffix) Partitio
ning
Overlapping Algorithmpositiv
enegativ
eneutral
Hash-based
Sort-Based
Canopy Clustering[McCallum et al. 00]
Standard blocking [Fellegi& Sunter 69, Kolb et al.,12b]
Q-grams [Gravano et al., 01]
Suffixes [Aizawa & Oyama, 05]
Sorted neighborhood [Hernàndez & Stolfo95, Kolb et al.,12a]
Adaptive sorted neighborhood [Yan et al.,07]
Token blocking [Papadakis et al.,11]
Attribute clustering [Papadakis et al.,13]
Prefix-infix (suffix) [Papadakis et al. 12]
ppjoin+ [Vernica et al., 10]
Frequent itemsets [Kenig&Gal,13]
Schema Agnostic
Fall 2020
Matching and Resolving Entities (I): Block Post-Processing
Fall 2020Meta-blocking: Improving the Efficiency of Blocking
101
Eliminate redundant comparisons & reduce unnecessary comparisons
MatchingBlocking
Meta Blocking
Fall 2020
Meta-blocking – The Blocking Graph
102
Eiffel
e1, e4, e2, e3
Tower
e1, e4, e5
Liberty
e2, e3
Paris
e1, e4
NY
e2, e3
1889
e1, e4
e1
e5
e2
e4
e3
1
3
11
1
5
1
1
edge weights = #common blocks
e1
e5
e2
e4
e3
14 comparisons to identify 2 matches e1-e4 and e2-e3
2 comparisons to identify 2 matches
Blocks (ToB):
Blocking graph (Nodes: entity descriptions, Edges: common block)
Pruned blocking graph (discard edges with weight below avg.: 1.75)
Prune edges to discard unnecessary comparisons between non-matches based on positive overlapping evidence
Sauvestre
e1, e4
Fall 2020
Edge Weighting and Pruning
Weighting Schemes (how to weight the edges) Common Blocks (CBS): wi,j =|Bi,j|
Jaccard (JS): wi,j = |Bi,j|/(|Bi|+|Bj|-|Bi,j|)
Enhanced CBS (ECBS): wi,j = CBS • log(|B|/|Bi|) • log (|B|/|Bj|)
Pruning Methods (which edges to prune)WEP: Keep edges with weight above average
CEP: Keep top-K edges overall
WNP: Keep, for each node, the edges with weight above a local average
CNP: Keep, for each node, its top-K edges
103
Fall 2020
Matching and Resolving Entities (II): Iterative Resolution Techniques
Fall 2020
Central vs. Peripheral KBs
105
Fall 2020
In Search of Similarity Evidence in KBs [Efthymiou et al. 2016, 2015]
Attribute-based comparisonsunique attributes (e.g., rdfs:label) provide strong evidence
>90% of matching pairs have >80% overlap similarity in the values of rdfs:label
Content-based comparisonscentral KBs: 3-4 common tokens in entity values
peripheral KBs: 1-2 common tokens in entity values
blocking algorithms miss up to 30% matches in peripheral KBs Relationships-based comparisons
matching neighbors provide positive evidence
>92% of pairs with at least one matching neighbor, are matches in most KBs
some types of relationships provide strong negative evidence
dissimilar values for wasBornIn indicate a non-matching pair
106
Fall 2020
Types of Missed Matches (FNs)
107
▪ Type A: a third, matching description (transitivity)
▪ Type B: matches of their neighbors
applicable to identify matches within a KB
can identify matches both within a KB and across different KBs
directs
isdirected
Fall 2020
Iterative ER
108
MatchingBlocking
Iterative ER: identify new matches based on partial results either of matches or of merges
Increase the number of matching entities
▪Good for high Variety
Fall 2020
Iterative ER Approaches
Merging-based: new matches can be found by exploiting merged (more complete) descriptions of previously identified matchesIdea: ER resembles a database self-join operation (of the initial set of
descriptions with itself)
But: No knowledge about which descriptions may match, so all pairs of descriptions need to be compared
Matching-based: If descriptions related to entity ei are matching to descriptions related to ej, then ei and ej are likely to matchIdea: ER resembles to a graph traversal problem in which similarity is
propagated until a fixed point is reached
Use positive or negative evidence for prioritize similarity re-computation
109
Fall 2020
Matching and Resolving Entities (II): Merging-based Iterative Resolution
Fall 2020
Merging-based ER– Formal Definition
Let E = {e1, ..., em} be a set of entity descriptions,M : E ×E → {true, false} be a match function, and
: E ×E → E be a partial merge function
The merging-based (generic) resolution of entities in E is a set of descriptions E’, such that:
1.∀ei, ej ∈ E : M(ei, ej) = true, ∃ek ∈ E’: (ei, ej) ≤ ek
2.∀ek ∈ E’, ∀el ∈ E, el ≤ ek
3.no strict subset of E‘ satisfies conditions 1 and 2where el ≤ ek means that ek is at least as informative than el, regarding the same real-world entity
Note: E’ E (the merge closure)Condition 1: E’ cannot produce more than E
Condition 2: E’ produce at least all information of E
Condition 3: Minimal solution111
Fall 2020
Match and Merge Functions: ICAR Properties [Benjelloun et al., 2009]
Idempotence: a description always matches itself, and merging it with itself still yields the same description∀ei ∈ E if M(ei,ei) = true then (ei,ei) = ei
Commutativity: direction of match and merge is irrelevant∀ei,ej ∈ E if M(ei,ej) = M(ej,ei) = true then (ei,ej) = (ej,ei)
Associativity: order of merge is irrelevant∀ei,ej,ek ∈ E if ((ei,ej),ek) and (ei,(ej,ek) exist then ((ei,ej),ek) = (ei,(ej,ek)
Representativity: merging does not lose matches; no “negative evidence”if ek=(ei,ej) then ∀el ∈ E such that M(ei,el) = true also M(ek,el) = true
Transitivity is not assumed!
112
Fall 2020
Merge Domination & Monotonicity
When the match and merge functions satisfy the ICAR properties, there is a natural domination (partial) order of descriptions
Given two descriptions, e1 and e2, we say that e1 is merge domi-natedby e2, denoted e1≤e2, if M(e1,e2) = true and (e1,e2) = e2
e1 does not add information
▪ Merged descriptions always dominates the ones they have derived from
∀e1, e2 ∈ E such that M(e1,e2) = true, it holds that e1 ≤ (e1,e2) and e2 ≤ (e1,e2)
Match function is monotonic ∀e1, e2, e3 ∈ E If e1 ≤ e2 and M(e1,e3)=true, then M(e2,e3)=true
Merge function is monotonic
∀e1, e2, e3 ∈ E If e1 ≤ e2 and M(e1,e3) , then (e1,e3) ≤ (e2,e3)
∀e1, e2, e3 ∈ E If e1≤e3, e2≤e3 and M(e1,e2)=true, then (e1,e2) ≤ e3
113Generic Entity Resolution with Swoosh F. Naumann 20.6.2013
Fall 2020
Ordering Match & Merge to Get the Right Answer?
114
e1
e3
M(e2,e3)=truee2,3=(e2,e3)
M(e1,e2,3)=true
KB1:SKBRK
KB1:name “S. Kubrick”
KB1:birthPlace KB1:Manhattan
KB1:deathPlace KB1:UnitedKingdom
KB1:diedIn 1999
KB1:Stanley_Kubrick
KB1:birthPlace KB1:Manhattan
KB1:bornIn 1928-7-26
KB1:parents KB1:Gertrude Kubrick
KB1:parents KB1:Jacques Leonard Kubrick
rdf:type yago:AmericanFilmDirectors
KB2:SKubrick
foaf:name “Stanley Kubrick”
KB2:place_of_birth KB2:MNHT
rdf:type foaf:Person
KB2:activeYearsEndYear 7/3/1999
KB2:directed KB2:A_Clockwork_OrangeKB12:Stanley_Kubrick
birthPlace Manhattan
bornIn 1928-7-26
parents Gertrude Kubrick
parents Jacques Leonard Kubrick
rdf:type AmericanFilmDirectors
activeYearsEndYear 7/3/1999
directed A_Clockwork_Orange
e2
Fall 2020
R-Swoosh [Benjelloun et al., 2009]
Assumes ICAR and merge domination
Idea 1: if M(e1,e2) = true we can remove e1 and e2
Whatever would match e1 or e2 now also matches (e1,e2)
Representativity and associativity
Idea 2: Removal of dominated descriptions is not necessary as a last step in the algorithmAssume e1 and e2 appear in final answer where e1 ≤ e2
Then M(e1,e2) = true and (e1,e2) = e2
Thus comparison of e1 and e2 should have generated merged description e2, and e1 should have been eliminated
115Generic Entity Resolution with Swoosh F. Naumann 20.6.2013
Fall 2020
R-Swoosh Example
116
e1
e2
e3
E E '
e2
e3
e1
E E '
e23 e1
E E '
e123
E E '
e3
e1
e2
E E '
e123
E E '(a) (b) (c)
(d) (e) (f)
Fall 2020
ER with ICAR properties
General ER process is guaranteed to be finite Entity descriptions can be matched and merged in any order
Dominated entity descriptions can be discarded anytime
Union class of match and mergeUnion-merge: All values are kept in merged descriptions
Union-match: At least one of the values in common
Commutativity of match and merge functions for highly heterogeneous descriptions does not always hold
117
Fall 2020
Matching and Resolving Entities (II): Matching-based Iterative Resolution
Fall 2020
Matching-based Iterative ER
Pair-wise ER: matching decisions are made independentlyDeduplication, Linkage
Collective ER: matching decisions depend on other matching decisions according to positive and negative evidence (general constraints)Similarity propagation approaches (more scalable)
Dependency graphs of matching decisions
Collective relational clusteringProbabilistic approaches (scalability is an open issue)
Generative Models (for acyclic dependencies between match decisions)
Undirected Models based on Markov Networks (sometimes with a first-order logic syntax)
Hybrids of constraints and probabilistic models
119
Fall 2020
Similarity Propagation Approaches
A graph structure for encoding the similarity between entity descriptions and matching decisions, and iteratively assess matching of entities by propagating similarity valuesDetails of how the graph is constructed and traversed and how (content and
context) similarity is computed vary
Similarity-propagation ER: the match function is re-computed at each iteration step by considering previous matching decisions: Mn(ei,ej) = true, if simn-1(ei,ej) ≥
Mn(ei,ej) = false, if simn-1(ei,ej) ≤ ’
Mn(ei,ej)= undecided, otherwise
Total similarity: sim(ei,ej)= a*simnbr(ei,ej)+(1-a)*simnbr(nbr(ei),nbr(ej))
where nbr(e) denotes the neighborhood (in, out) nodes of e
120
Fall 2020
Maintaining the Order of Comparisons
In similarity-propagation approaches the order of comparisons isdynamic
Graph traversal usually supported by a priority queue (PQ) on the similarity score of nodesAs entities are resolved, the PQ is updated for maximizing effectiveness &
reducing re-comparisons
Different strategies of order maintenance:Based on heuristics
type of nodes and edge direction [Dong et al. 2005]
degree of nodes [Weis & Naumann 2006]
edge weights [Kalashnikov & Mehrotra 2006] Triggered by recent matches [Böhm et al. 2012, Lacoste-Julien et al. 2013]
Lazy maintenance [Herschel et al. 2012, Altowim et al. 2014]
121
Fall 2020Dependency Graph [Dong et al., 2005]
Works on an entity graph constructed from the relational records nodes represent similarity comparisons between pairs of records and/or their
attribute values (real-valued)
edges represent match decisions based on the matching of associated nodes (boolean-valued)
A matching decision is taken when the real-valued similarity score (between 0 and 1) of a node is above a threshold θIf it exceeds the threshold, it is marked as match, otherwise as undecided,
if no more neighbors are undecided, it is marked as non-match
Idea 1: consider richer matching evidence Idea 2: propagate similarity between matching decisions Idea 3: Gradually enrich references by merging attribute values
(Swoosh-style)
122
Fall 2020
Dependency Graph: Example
Let E be a set of entity descriptionsA node v = {ei,ej}, where ei,ej∈ E, i ≠ j
An edge e = (va,vb) from va={eai,eaj} to vb = {ebi, ebj} implies ebi , ebj∈ values(eai) values(eaj)
Directed edge when dependency is only in one directioncan be stated by domain experts
inferred by the data semantics (e.g., keys/forein keys, rdf properties)
Include only nodes whose two entities have the potential to be similar
123
v1
v2
v3 v4
Fall 2020
Consider Richer Matching Evidence [Dong et al., 2005]
Positive evidence (i.e., constraints for match nodes) is captured by the Boolean similarity of neighborhood nodesStrong-boolean: Resolution implies resolution of neighbour
E.g., if two movies are matched then director must also be matched
Weak-boolean: No direct implication
E.g., similarity of two movies increases as their rdf:labelsare highly similar
Negative evidence (i.e., constraints for non‐match nodes) is verified after similarity propagation is performed, and inconsistencies are fixed
124
Fall 2020
Similarity Propagation [Dong et al., 2005]
Similarity function for node u: sim(u) = Srv + Ssb + SwbSrv: from real-valued neighbors (decision-tree shape)Ssb: from strong-boolean-valued neighborsSwb: from weak-boolean-valued neighbors
When a node is matched, the similarity score of its neighbors is re-computed
Process converges ifSimilarity score is monotone in the similarity values of neighbors
Compute neighbor similarities only if similarity increase is not too small
125
Fall 2020
Traversing the ER Graph
126
Initially all nodes are active and placed in the PQA node is processed before its out-neigbors
h
a b
d
c
f
g
e
PQ
f
a
b
c
h
d
e
g
Fall 2020
PQ
b
c
h
d
e
g
Traversing the ER Graph
127
h
a b
d
c
f
g
e
merged node
inactive node
Fall 2020
PQ
c
h
d
e
g
Traversing the ER Graph
128
h
a b
d
c
f
g
e
Fall 2020
PQ
h
d
e
g
Traversing the ER Graph
129
h
a b
d
c
f
g
e
Fall 2020
PQ
d
e
g
Traversing the ER Graph
130
h
a b
d
c
f
g
e
Fall 2020
PQ
g
b
e
Traversing the ER Graph
131
h
a b
d
c
f
g
e
Fall 2020
PQ
b
e
Traversing the ER Graph
132
h
a b
d
c
f
g
e
Fall 2020
PQ
e
c
Traversing the ER Graph
133
h
a b
d
c
f
g
e
Fall 2020
PQ
c
Traversing the ER Graph
134
h
a b
d
c
f
g
e
Fall 2020
PQ
Traversing the ER Graph
135
h
a b
d
c
f
g
e
Fall 2020
Linda [Böhm et al. 2012]
Works on an entity graph constructed from the RDF descriptionsexploits the unique mapping constraint between two KBs
Key Idea: the more matching neighbors via similar relationships two descriptions have, the more likely it is that they matchString similarity of the literal values of entities: checked once
Contextual similarity of the graph neighbors:checked iteratively Two square matrices (|E| × |E|) are used:
X captures the identified matches (binary values)
Y captures the pair-wise similarities (real values)
Initialization: common neighbors & string similarity of literals
Updates: Use the new identified matches of X Until the priority queue (extracted from Y) becomes empty:
Get the pair (ei, ej) with the highest similarity: match by default!
Update X: matches of ei are also matches of ej
Update the similarity of nodes influenced by the new matches
136
Fall 2020
Matches e1 e2 e3 e4 e5
e1 1 0 0 0 0
e2 1 0 0 0
e3 1 0 0
e4 1 0
e5 1
137
PQ
e1 – e4
e2 – e4
e1 – e3
e5 – e3
e2 – e3
…
e3 e4
e2 e1
e5
A priority queue, derived by an initial
similarity compu-tation between all
pairs, based on their attribute values
Linda Algorithm Example
Fall 2020
Matches e1 e2 e3 e4 e5
e1 1 0 0 1 0
e2 1 0 0 0
e3 1 0 0
e4 1 0
e5 1
138
PQ
e1 – e4
e2 – e4
e1 – e3
e5 – e3
e2 – e3
…
e3 e4
e2 e1
e5
Linda Algorithm Example
the head of PQ is a match by default
Fall 2020
Matches e1 e2 e3 e4 e5
e1 1 0 0 1 0
e2 1 0 0 0
e3 1 0 0
e4 1 0
e5 1
139
PQ
e2 – e4
e1 – e3
e2 – e3
e5 – e3
…
e3 e4
e2 e1
e5
unique mapping constraint (1-1 Assumption)
similarity re-computation, based on the matching neighbors and the names of the links to them
Linda Algorithm Example
directs
directs
Fall 2020
Matches e1 e2 e3 e4 e5
e1 1 0 0 1 0
e2 1 1 0 0
e3 1 0 0
e4 1 0
e5 1
140
PQ
e2 – e3
e5 – e3
…
e3 e4
e2 e1
e5
directs
directs
Linda Algorithm Example
Fall 2020
Matches e1 e2 e3 e4 e5
e1 1 0 0 1 0
e2 1 1 0 0
e3 1 0 0
e4 1 0
e5 1
141
PQ
e5 – e3
…
e3 e4
e2 e1
e5
directs
directs
stops when PQ is empty
Linda Algorithm Example
unique mapping constraint (1-1 Assumption)
Fall 2020
Linda Distributed VersionMatches e1 e2 e3 e4 e5
e1 1 0 0 0 0
e2 1 0 0 0
e3 1 0 0
e4 1 0
e5 1
142
PQ
e1 – e4
e2 – e4
e1 – e3
e2 – e3
…
e3 e4
e2 e1
e5knows
knows
PQ
e5 – e3
e5 – e4
…
Node 1
Node 2
Node 1
Node 2
The node to hold the PQ entry (a,b) and the respective vertex neighbors is determined by a modulo operation of the first component (a)
Fall 2020
Matching and Resolving Entities (II): Progressive Resolution Techniques
Fall 2020
Progressive ER
144
MatchingBlocking
Planning
Update
Optimization: maximize benefit (number or type of matches) for a given cost (number of comparisons, disk/cloud access)
Progressive ER: estimates which part of the data to resolve next and adapts this decision in a pay as you go fashion
▪Good for high Velocity
Fall 2020
Progressive Approach to Relational ER [Altowim et al. 2014]
Key Idea: divides the ER process into several windows and generates a resolution plan for each window
Lazy resolution strategy to resolve pairs with the smallest costUnlike single entity type resolution a block based prioritization is significantly
more important when resolving multiple types145
– specifies which blocks and entity pairs within these blocks will be resolved during the plan execution phase of a window
– associates with each identified pair the order in which to apply the similarity functions on the attributes of the two entities
Fall 2020
Relational Progressive ER: Example
146
A1 M1
A2 D2
D1
M2
v1
v2
v3
v4 v8
v7
v6
v5 v9
v10
v11
v12
Altowim et al. Progressive Approach to Relational ER VLDB 2014
Actors Movies Directors
Blocks with entities of the same type
A1 M1
v1
v2
v3 v7
v6
v5
Fall 2020
Matching Probability Estimation
The nodes in the ER graph influencing a pair of entities are interpreted as common causes to the same effect
The probability that a node is processed at each round depends on the number of cause nodes that are matches and/or on the percentage of matching nodes in the block of the effect node
147
…
Effect
Cause Cause Cause
Noisy-OR Model
not required to know all the causes
Effect: Node considered for matchingCauses: (a) Influencing matching nodes(b) Fraction of matching
pairs in the block of
vi
xj
viAltowim et al. Progressive Approach to Relational ER VLDB 2014
Fall 2020
Heuristics for Estimating Node Benefit
148
A2 D2M2
A1 M1 D1
v1
v2
v3 v7
v6
v5
v4 v8
v9
v10
v11
v12
v1 v2 M1
Fraction of matching
pairs in the block
A M
match
non-match
Indirect Benefit
Direct Benefit
The Impact is estimated for k nodes of the same entity type, as the average number of their direct dependent
unresolved nodes
Altowim et al. Progressive Approach to Relational ER VLDB 2014
Fall 2020
Traversing the ER Graph (in Blocks)
ab
d
c f
g
e
h
0.3
0.1
0.1
0.2
0.3
0.30.2
0.1
F1
A1
D1M1
M2
Assign to blocks the sum of their nodes’ scores:s(B) = σ𝑣
𝑖∈𝐵 P(vi) + P(vi) ∗ Impact(vi)
Initially, a block of the entity type with the highest out-degree (impact) is selected (in random), since P(vi) is the same for all nodes (of the same type)
D2
Impact(directors) = 3,Impact(actors) = 2,Impact(movies) = 1Impact(family) = 1Initial P(vi) = δ
s(F1)=s(e)= 0.1+0.1*1=0.2
s(Α1)=s(c)+s(h)=(0.1+0.1*2)*2=0.6
s(M1)=s(a)+s(d)= (δ+δ*1)*2=4δ
Fall 2020
ab
d
cf
g
e
h
0.3
0.1
0.1
0.2
0.3
0.30.2
0.1
F1
A1
D1M1
M2
Then, update the scores, load the block with the highest score and resolve its nodes
D2
s(Α1)=s(c)+s(h)=(0.1+0.1*2)*2=0.6
s(M1)=s(a)+s(d)= (δ+δ*1)*2=4δ
s(F1)=s(e)= 0.1+0.1*1=0.2
Traversing the ER Graph (in Blocks)
Impact(directors) = 3,Impact(actors) = 2,Impact(movies) = 1Impact(family) = 1Initial P(vi) = δ
Assign to blocks the sum of their nodes’ scores:s(B) = σ𝑣
𝑖∈𝐵 P(vi) + P(vi) ∗ Impact(vi)
Fall 2020
ab
d
cf
g
e
h
0.3
0.1
0.1
0.2
0.3
0.30.2
0.1
F1
A1
D1M1
M2
Then, update the scores, load the block with the highest score and resolve its nodes
D2
s(M1)=s(a)+s(d)= (δ+δ*1)+ (0.2+0.2*1)=2δ+0.4
s(F1)=s(e)= 0.1+0.1*1=0.2
Traversing the ER Graph (in Blocks)
Impact(directors) = 3,Impact(actors) = 2,Impact(movies) = 1Impact(family) = 1Initial P(vi) = δ
Assign to blocks the sum of their nodes’ scores:s(B) = σ𝑣
𝑖∈𝐵 P(vi) + P(vi) ∗ Impact(vi)
Fall 2020
ab
d
cf
g
e
h
0.3
0.1
0.1
0.2
0.3
0.30.2
0.1
F1
A1
D1M1
M2
Then, update the scores, load the block with the highest score and resolve its nodes
D2
s(D2)=s(g)= δ+δ*3=4δ
s(M2)=s(f)= δ+δ*1=2δ
s(F1)=s(e)= 0.1+0.1*1=0.2
Traversing the ER Graph (in Blocks)
Impact(directors) = 3,Impact(actors) = 2,Impact(movies) = 1Impact(family) = 1Initial P(vi) = δ
Assign to blocks the sum of their nodes’ scores:s(B) = σ𝑣
𝑖∈𝐵 P(vi) + P(vi) ∗ Impact(vi)
Fall 2020
Comparison of ER Graph Traversals
Property [Dong et al. 2005] [Böhm et al. 2012] [Altowim et al. 2014]
Seed node selectionany node with no in-
edgesthe node with the
highest content_sim
any node of the entity type with the highest
out-degree
Matching conditionsimilarity is above a
threshold valuethe highest score (head
of PQ)a (black-box) resolve function returns true
Similarity/Evidence Update (PQ)
weighted sum of in-neighborssimilarities
weighted sum of matching in-/out-
neighbors similarities
leaky noisy-or of matching in-neighbors
Backtracking
Traversal policy
topological sort advance in the PQ the neighbors (with most
similar relationships) of the recent match
sort PQ based on #in-matches and #out-neighbors
Fall 2020
Web-scale Progressive ER: Ongoing Work
Relational progressive ER algorithms assume that the more entity pairs are correctly identified, the higher the quality of the result is expected to be
We are interested in characterizing the quality of resolved pairs
# of real-world entities resolved (shallow strategy)
entity-centric search (entity coverage)
# of real-world entity graphs resolved (deep strategy)
entity-centric recommendation (relationship completeness)
# descriptions resolved for the same real-world entity (tail strategy)
web-scale knowledge curation (attribute completeness)
154
Fall 2020
Matching and Resolving Entities (II): Challenges and Open Issues
Fall 2020
Challenges and Open Issues
Tight coupling of Blocking with Iterative Matching/MergingBetter control of block characteristics w.r.t. the entity similarity subsequently
used (see [J. Fisher et al. 2015]
Progressive ER with Quality Guaranteesguarantees (e.g., coverage) regarding the quality of matches/ merges w.r.t.
subsequent entity-centric services and data analysis tasks
ER for Big DataAlgorithms for high Velocity (see [D. Firmani et al. 2016]), Variety, and Volume
entity descriptions (see [Q. Wang et al. 2015] [L. Kolb et al. 2012])
Large-Scale ER TestbedsReal-world ground truth datasets for different match types and open source ER
platforms (see [Efthymiou et al. 2015, 2016])
156
Fall 2020
Challenges and Open Issues
Crowdsourced ER: reduce the crowdsourcing cost for obtaining ground truth (see [Chai et al. 2016]
[Gokhale et al. 2014] [Wang et al. 2012])
Temporal ER: resolve evolving entity descriptions and analyse the history of descriptions (see
[Dong & Tan 2015])
Uncertain ER: consider confidence scores when resolving certain & uncertain entity
descriptions (see [Gal 2014] [Demartini et al. 2013])
Privacy-aware ER: Trade-off between entity obfuscation techniques and ER results quality (see
[Whang & Garcia-Molina 2013])
157
Fall 2020
http://www.morganclaypoolpublishers.com/catalog_Orig/product_info.php?products_id=823
Fall 2020
The Minoan ER Framework
159http://csd.uoc.gr/~vefthym/minoanER
Fall 2020
License
These slides are made available under a Creative Commons Attribution-ShareAlike license (CC BY-SA 3.0): http://creativecommons.org/licenses/by-sa/3.0/
You can share and remix this work, provided that you keep the attribution to the original authors intact, and that, if you alter, transform, or build upon this work, you may distribute the resulting work only under the same or similar license to this one
160
Fall 2020
Questions?
161