Fall 2020 Entity Resolution in the Web of Datahy562/lectures20/lec09_Entity... · 2020. 11. 23. ·...

Fall 2020

V. Efthymiou, V. Christophides{vefthym|christop}@csd.uoc.gr

http://www.csd.uoc.gr/~hy562

University of Crete, Fall 2020

Entity Resolution inthe Web of Data

Fall 2020

Describing and Linking Entities: Entity-Centric Applications &

Knowledge Bases

Fall 2020

Why “Entities”

2

Fall 2020

Why “Entities”

3

Fall 2020

“Entities” is What a Large Part of Our Knowledge is About

4

Time in Bilbao Bilbao Population Popular schools in Bilbao

Weather in Bilbao

Bilbao places to go

Bilbao photos

Bilbao map

Fall 2020

Core Entities

5

Locations

Movies

Organizations

Persons

Fall 2020

What Is A Knowledge Base (KB)?

Comprehensive, machine-readable descriptions of real-world entities are hosted in knowledge bases (KB)

Entity names, types, attributes, relationships, provenance info

Entities are described as instances of one or several conceptual types and may be linked through relationships

The core Semantic Web data model

6

dbpedia:A_Clockwork_Orange_(film)

dbo:director dbpedia:Stanley_Kubrick

dbo:Work/runtime

“136”

foaf:name “A Clockwork Orange”

dbpedia:Stanley_Kubrick

dbo:birthPlace

dbpedia:Manhattan

rdf:type foaf:Person

rdf:type yago:AmericanFilmDirectors

rdf:type yago:AmateurChessPlayers

facts

properties/attributes values

Fall 2020

Knowledge Bases: Scope

Domain-specific Knowledge Bases

Focus on a well-defined domain:

IMDB for movies, Music-Brainz for music, GeoNames for geo, CIA World Factbook for demographics, SNOMED CT for medicine etc.

Cross-domain Knowledge Bases

Cover a variety of knowledge across domains

Concept-centric (Intensional): Cyc, WordNet

Entity-centric (Extensional): DBpedia, Wikidata, Yago, Freebase, Google’s Knowledge Graph, Satori Bing, Knowledge Vault

7Knowledge Curation & Knowledge Fusion: Challenges, Models, and Applications X. L. Dong D. Srivastava SIGMOD'15 Tutorial 8

Fall 2020

Knowledge Bases: History

1985 1990 2000 2005 2010

Cyc

x: human(x)

( y: mother(x,y)

z: father(x,z))

x,u,w: (mother(x,u)

mother(x,w)

u=w)

WordNet

guitarist {player,musician}

artist

algebraist

mathematician

scientist

Wikipedia

4.5 Mio. English articles

20 Mio. contributors

from humansfor humans

from algorithmsfor machines

Adapted from Knowledge Bases in the Age of Big Data Analytics F. Suchanek, WeikumVLDB 2014 Tutorial

Knowledge Graph

9

Fall 2020

The Long Tail of Knowledge

Freebase is large, but still very incomplete:

We need automatic KB construction methods

9

Relation % unknownin Freebase

Profession 68%

Place of birth 71%

Nationality 75%

Education 91%

Spouse 92%

Parents 94%

Kevin Murphy From Big Data to Big Knowledge CIKM 2013 industry talk cikm2013.org/slides/kevin.pdf

Fall 2020

Google’s Knowledge Graph (GKG)

600M nodes (entities)

1.5K entity types

20B edges (triples)

35K properties

10

▪ The Knowledge Graph is used by Google to enhance its search engine's search results with semantic-search information

– GKG is rooted in public KBs such as Freebase, Wikipedia and the CIA World Factbook

– It’s also augmented at a much larger scale focusing on a comprehensive breadth and depth of entity descriptions

Fall 2020

Google Knowledge Vault (GKV)

A. Bordes, E. Gabrilovich Constructing and Mining Web-scale Knowledge Graph KDD 2014 Tutorial11

Evaluate trustworthiness of sources by the correctness of its factual information

Probabilistic Knowledge Fusion

Fall 2020

Stanley Kubrick

Entity descriptions

Resolved entities

Entity-Centric Processing

lived-in

Entity-centric search

Entity-centric recommendation

Entity-Centric Applications

12

entityfacts

spouse

birth place

director

death placespouse

death place

spouse

studio

entityrelationships

Fall 2020

https://finnaarupnielsen.wordpress.com/tag/wikidata/

Citizen Journalism: Occupations of Persons from Panama Papers

Fall 2020

Entity-Centric Information Processing

Automated construction of entity descriptionsInformation extraction: extract new entities from web/text

Link prediction: add relationships among entities

Entity integration and resolutionKnowledge base integration: instance & ontology mappings

Entity resolution: merging or splitting similar entities

Entity-centric access interfacesAugmented search: interpret the meaning of queries using entities and compute

answers based on a knowledge base

Entity-based matching: recommend new entities given an entity, a user or a query

Entity-centric summarization: of textual posts in social media

14

Fall 2020

Describing and Linking Entities: Linked Knowledge Bases &

The Web of Data

Fall 2020

The Web of Data

16adapted from Chris Bizer, Richard Cyganiak, Tom Heath, http://linkeddata.org/guides-and-tutorials

Web of Data

Conceptual Representation

entity entity entityentity

Typed Links Typed Links Typed Links

Spreadsheets

HTMLXMLRDFa

UR

Is

UR

Is

UR

Is

UR

Is

UR

Is UR

Isrepresent

SemiStructuredTriplesStatistical

represent represent represent

A Web of things in the world (aka entities),

described by data on the Web

Fall 2020

From Open to Linked Data

17

Fall 2020

The Linked Data Principles

Anyone can publish data on the Web regarding real-world entities by respecting a minimal set of syntactic conventionsUse URIs as names for thingsUse HTTP URIs so that people and machines

can look up those entitiesWhen someone looks up a URI,

provide a useful descriptionInclude links to other URIs, so that

they can discover more things

Data becomes self-describingApplications encountering data described by an unfamiliar vocabulary, they can

resolve its URIs and understand the vocabulary terms by their RDFS or OWL definitions

18

Fall 2020

Linking Entities in KBs

19



dbo:Work/runtime

“136”



dbo:birthPlace

dbpedia:Manhattan




lmdb:director/8476

lmdb:director_name

“Stanley Kubrick”


foaf:made lmdb:film/1894



fbase:m.05ldxl

fbase:film.directedBy lmdb:director/8476

fbase:film_cut/runtime “136”


dbpedia:Manhattan

rdfs:label “Manhattan”

rdf:type schema.org:Place

dbo:populationTotal

1,626,159

Fall 2020

From Data Silos to the Web of Data

Thanks to W3C standards we can support:A uniform and universal access to informationEasy linking and re-usage of information in different contextsNetwork effects in adding value to information

Data Interoperability at the Web ScaleConnecting data from diverse sources and across domains

OpennessMany common things are represented in multiple sourcesDiscover new sources at run-time by following links

20

Fall 2020

The Linked Open Data (LOD) Cloud

21

Media

Government

Geographic

Publications

User-generated

Life sciences

Cross-domain

Social networking

Linguistics

1014 Knowledge Bases60B Triples

649 vocabularies

Last updated: 2014-08-30 http://lod-cloud.net/

Fall 2020

Linked Open Vocabularies (LOV)

http://lov.okfn.org/dataset/lov22

Fall 2020

adapted from Suchanek & Weikum tutorial@SIGMOD 2013

LOD Cloud and KBs

23

Fall 2020

LOD KBs: Entity Statistics

24

685 entity types 4.58 M entities5.83 M triples2.795 K properties

1.5 K entity types46.3 M entities2.67 B triples4.5 K properties

350 K entity types 10M entities120M triples100 properties

adapted from Suchanek & Weikum tutorial@SIGMOD 2013

English version

52 entity types503K entities6M triples

123 entity types1.9 B entities9.6 B triples112 properties

Fall 2020

LOD KBs: Entity Interlinking

25

Only 56.11% of the KBs link to at least another KB

17.36% of them link to only one other KB (typically DBpedia)

The LOD cloud diagram is sparse

top KBs (wrt. in-degree)

In-degree

DBpedia 207

geonames 141

w3.org 117

Fall 2020

Describing and Linking Entities: Description Quality &

Entity Resolution

Fall 2020

Quality of Entity Descriptions in the Web of Data

Given the open and decentralized nature of the Web, reliability and usability of entity descriptions need to be constantly improvedIncompleteness: real world entitles are only partially described in KBs

Redundancy: descriptions of the same real world entities usually overlap in multiple KBs

Inconsistency: real world entities may have conflicting descriptions across KBs

Incorrectness: errors can be propagated from one KB to the other due to manual copying or automated extraction/fusion techniques

29

Fall 2020

Various Forms of KBs Overlapping

among KBs (inter-overlapping, due to common data sources)

within the same KB (intra-overlapping, due to data extraction or curation problems)dbpedia:Robert_Soloway and dbpedia:Spam_King refer to the same individual,

Robert Soloway, who was nicknamed the “Spam King”

less often than inter-overlaping

30

not identical descriptions, even if they stem from the same source

Fall 2020

Entity Resolution (ER)

The problem of identifying descriptions of the same real-world entity

31

Stanley Kubrick

A Clockwork Orange

somehow similar descriptions

highly similar descriptions

director

Data Variety

Fall 2020

From Highly to Somehow Similar

Highly Similar

feature many common tokens in the values of semantically related attributes

heavily interlinked

mostly using owl:sameAspredicates

usually provide complementary descriptions of the same entity in terms of facets, temporal evolution

good for fusing

typically met in central KBs

Somehow Similar

feature significantly fewer common tokens in attributes that are not always semantically related

sparsely interlinked

using various kinds of predicates

usually complementary descriptions of the same entity in terms of domains, aspects

good for linking

typically met in peripheral KBs

32

Fall 2020

How does ER Improve KB Quality?

KB Completeness: Linking somehow similar descriptions will increase coverage of entity facts

and relationships

KB Conciseness: Merging highly similar descriptions will reduce duplicate entity facts and

relationships

KB Consistency: Matching similar descriptions will enable to detect conflicting assertions

KB Correctness: Splitting complex descriptions will facilitate entity repairing

33

Fall 2020

Web-Scale ER: Challenges

The two core ER problems, how can we

effectively compute entity similarity

efficiently resolve sets of entities within or across KBs

are challenged by the

important number of KBs (~ hundreds)

large number of entity types & properties (~ thousands)

massive volume of entities (~millions)Large-scale, progressive, multi-type, cross-domain ER

Big Data Volume, Velocity and Variety!

34

Fall 2020

Entity Resolution – Formal Definition

Entity resolution: The problem of identifying descriptions of the same entity within or across sources E = {e1, ..., em} is a set of entity descriptions

M : E ×E → {true, false} is a match function

The resolution of entities in E results in a partition P = {p1, ..., pn} of E, such that:

1.∀ei, ej ∈ E : M(ei, ej) = true,∃pk ∈ P : ei, ej ∈ pk

2.∀pk ∈ P, ∀ei, ej ∈ pk, M(ei, ej) = true

each partition contains only matching descriptions

all the matching descriptions are in the same partition

Fall 2020

Entity Resolution - Example

36


dbo:birthPlace

dbpedia:Manhattan




lmdb:director/8476

lmdb:director_name






e1

e3



dbo:Work/runtime

“136”


fbase:m.05ldxl




e2 e4

e5

Imdb:co0041067

imbd:location GeoNames:Buckinghamshire

imbd:filmography “A Clockwork Orange”

Fall 2020

Entity Resolution - Example

Assume as input of entity resolution, the set E = {e1, e2, e3, e4, e5}

A possible output P = {{e1, e3}, {e2, e4}, {e5}} indicates that: e1, e3 refer to the same real-world person, the

director Stanley Kubrick

e2, e3 represent a different entity, the movie AClockwork Orange

e5 represents a third thing, the movie studio PineWood

37

e1 e3

e2 e4

e5

Fall 2020

Matching Graph-structured Entities

38

dbpedia:Stanely_Kubrick

freebase:Stanely_Kubrick

dbpedia:A_Clockwork_Orange

linkedMDB:A_Clockwork_Orange

dbpedia:Katharina_Kubrick

dbpedia:The_Shining

freebase:The_Shining

freebase:Ruth_Sobotka

directsdirector

directorchildren director spouse

e1

e2

e3

e4

e5 e6e7 e8

Fall 2020Single (Pairwise) Entity Matching Based on Content Similarity

39

e1

birthPlace Manhattan

type Person

type AmericanFilmDire

ctors

type Amateur

ChessPlayers

e3

director_name “Stanley

Kubrick”

type Person

directs film/1894

directs film/2014

directs film/2685thresh = 0.5

simc(e1,e3) = Jaccard ({Manhattan, Person, AmericanFilmDirectors, AmateurChessPlayers}, {Stanley, Kubrick, Person, 1894, 2014, 2685}) = 0.1

simc: let the content similarity of two descriptions be the Jaccard similarity of their values’ token sets

Matching decisions are independent

Fall 2020

Collective (Joint) Entity Matching Based on Structural Similarity

40






dbpedia:The_Shining




e1

e2

e3

e4

e5 e6e7 e8

One matching provides evidence for another

director directs

Fall 2020

Matching Neighborhood

41

e2

director e1

Work/

runtime

“136”

name “A Clockwork

Orange”

e4

film.directedBy e3

film_cut/runtime “136”

name “A Clockwork

Orange”thresh = 0.5

simc(e2,e4)= Jaccard ({e1, 136, A, Clockwork, Orange}, {e3, 136, A, Clockwork, Orange}) = 0.66

Fall 2020

Similarity Propagation

42






dbpedia:The_Shining



directsdirector


e1

e2

e3

e4

e5 e6e7 e8

Fall 2020

Forms of ER

43

sameAs

KB1 KB2

KB1 KB2Record Deduplication: ER with results merging

exploits transitivity of matches

Record Linkage: ER without results merging

exploits exclusivity of matches

Fall 2020

High similarity in structure

Low similarity in structure

High similarity in content

Low similarity in content

ER Forms & Similarity

44

string sim. in the values of specific attsfrom one relation

set sim. in the values of specific attsfrom tworelations

att & value sim. in a network of relations

Fall 2020

ER Settings

Two kinds of entity collections as input:Clean: redundant-free

Dirty: contains redundant entity descriptions

Given two entity collections as input, an ER task can be:Clean-Clean ER: Given two clean, but overlapping entity collections, identify

the common entity descriptions

a.k.a. the record linkage in databasesDirty-Dirty ER: Identify unique entity descriptions contained in the union of

two dirty input entity collections

aka the deduplication problem in databasesDirty-Clean Entity Resolution

45

Fall 2020

What We Will Cover Next

Describing and Linking EntitiesEntity-Centric Applications & Knowledge Bases

Linked Knowledge Bases & The Web of Data

Description Quality & Entity Resolution

Matching and Resolving Entities (I)Entity Similarity (Content & Context)

Blocking Techniques (Token, Attribute & URI)

Block Post-Processing

Matching and Resolving Entities (II)Iterative Resolution Techniques (Merging & Matching)

Progressive Resolution Techniques

Conclusions & Open Issues

46

Fall 2020

Matching and Resolving Entities (I): Entity Similarity

Fall 2020

Similar or Dissimilar?

Fall 2020

Shape Color

Size Pattern

Similarity is Domain Specific

Fall 2020

Entity Resolution - Match

Matches: Sets of entity descriptions that refer to the same real-world entity, intuitively: Matching descriptions are placed in the same partition

All the descriptions of the same partition match

A match function M() maps each pair of entity descriptions (ei, ej)to {true, false} → i.e., match or no matchM(ei, ej) = true => ei, ej are matches

M(ei, ej) = false => ei, ej are non-matches

53

Fall 2020

Match Function: Formal Properties

The match function M() introduces an equivalence relation (owl:sameAs) among entity descriptionsReflexivity: ∀ei ∈ E M(ei, ei) = true

Symmetry: ∀ei, ej ∈ E M(ei, ej) = M(ej ,ei)

Transitivity: ∀ei, ej, ek ∈ E if M(ei, ej) = true and M(ej, ek) = true, then M(ei, ek) = true

Additional properties are useful when linking & repairing entitiesExclusivity (1-1 Assumption): ∀ei , ek ∈ E and ∀ej ∈ E’ if M(ei, ej) = true, then

M(ek, ej) = false

54

Fall 2020

Entity Resolution - Similarity

In practice, the match function is defined via a similarity function sim(), measuring how similar two entity descriptions are to each other, according to certain comparison criteria

Given a similarity threshold :M(ei, ej) = true, if sim(ei, ej) ≥

M(ei, ej) = false, otherwise

ML techniques for automatically learning similarity measures are challenged by a Web-scale entity resolution [Köpcke et al. 2010]adaptive learning techniques require training data for each domain [Tejada et al.

2002] [Bilenko et al. 2003]

active learning techniques (threshold-based Boolean functions or linear classifiers) work well with highly similar descriptions [Arasu et al. 2010]

55

Fall 2020

Entity Similarity: Example

although not identical, e2 and e4 are highly similar

57



dbo:Work/runtime

“136”


fbase:m.05ldxl




e2 e4


dbo:birthPlace

dbpedia:Manhattan




lmdb:director/8476

lmdb:director_name






e1

e3

e1 and e3 are at best somehow similar

Fall 2020

The Role of Similarity Functions: A Pragmatic Case

Missed matching pairs of entity descriptions

True matching pairs of entity

descriptions

Pairs of entity descriptions satisfying a

pragmatic similarity function

Set of all pairs of entity

descriptions

Matching pairs of entity

descriptions

Fall 2020

The Role of Similarity Functions: The Ideal Case

Set of all pairs of entity

descriptions

Matching pairs of entity

descriptions

Pairs of entity descriptions

satisfying an idealsimilarity function

Fall 2020

Similarity Metric: Formal Description

A similarity function sim(): Ε × Ε → R is a metric, if for every ei,ej,elE for a given set E, it satisfies the following conditions:

sim(ei,ei) ≥ 0 (positive),

sim(ei,ei) ≥ sim(ei,ej),

sim(ei,ej) = sim(ej,ei) (reflexive),

sim(ei,ei) = sim(ej,ej) = sim(ej,ei) ej=eisim(ei,ej)+sim(ej,el) ≤ sim(ei,el)+sim(ej,ej) (triangle inequality)

Normalized similarity: sim(ei,ei)∈[0,1]

sim(ei,ei) = 1 for exact match

sim(ei,ej) = 0 for “completely” different ei and ei0 < sim(ei,ej) < 1 for some approximate similarity

Any similarity metric can be transformed to a distance metric and vice versa

sim(ei,ej)= 1 – dist(ei,ej) 60

Fall 2020

Euclidian vs Non-Euclidian Distances

A Euclidean space: has some number of real-valued dimensions and dense pointsThere is notion of average of two points

A Euclidian distance is based on the locations of points in such space (L2-norm, L1-norm)

A Non-Euclidian distance is based on properties of points, but not on their “location” in space Jaccard, Cosine, Edit, Hamming distance

61http://dataconomy.com/implementing-the-five-most-popular-similarity-measures-in-python/

Fall 2020

Matching and Resolving Entities (I): Content-based Entity Similarity

Fall 2020

In Search of Entity Similarity Measures

Defining similarity functions that satisfy the formal properties of metric spaces is, in practice, too restrictive for non-geometric models

Two main families of similarity measures for resolving entity descriptions in the Web of data

Content-based: mostly for measuring string similarity of attribute values in pairs of entity descriptions character-based, token-based

Context-based: exploit similarity of neighbor descriptions via different entity relationships tree-based, graph-based

63

Fall 2020

String Similarity Measures

Record Linkage: Similarity Measures and Algorithms N. Koudas S. Sarawagi D. Srivastava SIGMOD'06 Tutorial

Dice

PhoneticDomain dependent (Rules, DataTypes)

Good for Text-intensivedescriptionsGood for Names

Good for abbreviations,nick names

64

Fall 2020

Tokenizing Strings: Separators

Forming words from sequence of characters: Surprisingly complex in English, can be harder in other languages

General idea: Separate string into tokens using some separatorSpace, hyphen, punctuation, special characters

Usually also convert to lowercase

Problems: Sometimes hyphens should be considered either as part of the word or a word

separator: e.g., Spanish-speaking

Apostrophes can be a part of a word, a part of a possessive, or just a mistake: e.g., master's degree

Periods can occur in numbers, abbreviations, URLs, ends of sentences, and other situations: e.g., F.M.I

65Similarity measures Felix Naumann 11.6.2013

Fall 2020

Tokenizing Strings: n-grams

Split string into short substrings of length n

Sliding window over stringn=2: Bigrams

n=3: Trigrams

Variation: Pad with n – 1 special charactersEmphasizes beginning and end of string

Variation: Include positional information to weight similaritiesNumber of n-grams = |x| – n + 1

Count how many n-grams are common in both strings

66Similarity measures Felix Naumann 11.6.2013

Fall 2020

Token-based Entity Similarity

67

name Eiffel Tower

architect Sauvestre

year 1889

location Paris

name Statue of

Liberty

architect Bartholdi Eiffel

year 1886

located NY

about Lady liberty

architect Eiffel

location NY

about Eiffel Tower

architect Sauvestre

year 1889

located Paris

name White Tower

location Thessaloniki

year-

constructed

1450

e1e2

e3

e4

e5

Jaccard(e1,e3) = 1/8Jaccard(e1,e4) = 1Jaccard(e1,e5)= 1/8

Jaccard(e2,e3) = 3/7Jaccard(e2,e4) = 1/11Jaccard(e2,e5) = 0/11

Jaccard( tokens(ei), tokens(ej) ) = | tokens(ei) ∩ tokens(ej) || tokens(ei) tokens(ej) |

Fall 2020

Matching and Resolving Entities (I): Context-based Entity Similarity

Fall 2020

Similarity of Hierarchical Data

69

ID Actor Film

S1 Al Pacino F1

S2 Al Pacino F2

S3 Marlon Brando F2

ID Name Year Rating

F1 The Godfather 1972 9.2

F2 Gottvatter, The 72

m1,movie

t1,title s1,set

a11,actor a12,actorTroy

Brad Pitt Eric Bana

m2,movie

t2,title s2,set

a21,actor a22,actorTroja

Brad Pit Erik Bana

a23,actor

Brian Cox

y1,year

2004

y2,year

04

▪ Relational star / snowflake schema [Ananthakrishna et al. 2002]

▪ Hierarchical XML data [Calado et al.2010]

single branch

multiple branch

MovieCasting

Fall 2020

DELPHI Containment Metric [Ananthakrishna et al. 2002]

Hybrid measure considering both similarity of attribute values (tcm)

and similarity of children sets reached by foreign keys (fkcm)

Similarity of attribute values

Divide tuples into tokens → token sets TS

Compute the edit distance between the tokens of different sets

Determine weight of each token via inverse document frequency

Thus the idf of a rare term is high, whereas the idf of a frequent

term is likely to be low

The token similarity metric tcm measures which fraction of one tuple T is covered

by the other tuple T’ in a relation of interest Ri

70

Fall 2020


Similarity of children sets

The children set CS of a tuple T includes all tuples of a relation Rj referencing Tfrom a relation Ri by means of a foreign key

Foreign-key containment metric (fkcm) measures at what extent the children set

of a tuple T is covered by the children set of a tuple T’

71

Fall 2020


Combining tcm and fkcm

Both tcm and fkcm are assigned an IDF weight

Use of a classification function:pos(x) = 1 if x > 0,

-1 otherwise

Threshold for tcm: s1

Threshold for fkcm: s2

Classification of pairwise comparison between T and T‘ using

• If final result equals 1, then match, otherwise non-match

72

Fall 2020

DELPHI Containment Metric: Example

73

1. Token sets:

TS(F1) = {The, Godfather, 1972, 9.2}

TS(F2) = {Gottvatter, The, 72}

2. Attribute value similarities

The = The, Godfather = Gottvatter,

1972 = 72.

3. Weights

For simplification, we assume all

tokens have equal weight

4. Token containment metric

tcm(F1,F2) = ¾, tcm(F2,F1) = 1

5. Children co-occurrence

fkcm(F1,F2) = 1, fkcm(F2,F1) = ½

6. Combination of both metrics

(s1 = s2 = 0.5, weights = 1)

pos( pos(3/4 - 0.5) + pos(1 - 0.5) = 1

→ F1 and F2 match

ID Actor Film

S1 Al Pacino F1

S2 Al Pacino F2

S3 Marlon Brando F2

ID Name Year Rating

F1 The Godfather 1972 9.2

F2 Gottvatter, The 72

Fall 2020

Graph-Context Similarity

Assign a score sx y to each pair of entities x and y two descriptions are similar if their structural neighborhood is similar (hide

attribute values)

Local similarity Indices Common neighbors (CN)

where is the set of neighbors Jaccard: Normalized common neighbors (NCN)

where is the set of neighbors

Adamic-Adar (AA):

where is the degree of node z

and counts the uniqueness

of z, which is inversely proportional to its number of neighbors74

N

Fall 2020

Content & Structure Similarity: SiGMa [Lacoste-Julien et al 2013]

Even though entities i and j have no tokens in common, the fact that several of their respective neighbors are matched together is a strong evidence that i and jshould be matched togetherUse neighbors for scoring and suggesting candidate pairs 78

Fall 2020

SiGMa Neighbors Similarity

Compatible-neighbors Nij: a neighbor k of i being matched to a compatible neighbor l of j should encourage i to be matched to j

Nij = {(k, l): (i, r, k)KB1 and (j, s, l)KB2 and relationship r is matched to s}

79

Fall 2020

In Search of Entity Similarity Measures

For highly similar entities content similarity (i.e., their attribute values) is sufficient

For somehow similar entities we need to also consider the similarity of the structured context of entities in an iterative way Identifying most discriminating attributes and relationships is helpful

An orthogonal issue is the schematic discrepancy of attributes and relationships employed in the entity descriptions whose hybrid similarity is assessed Simple: either ontological commitments exist or schematic mappings are

provided by the users

Complex: assess similarity of attributes and relationships based on the similarity of their names or values

81

Fall 2020

Matching and Resolving Entities (I): Blocking Techniques

Fall 2020

Typical ER Workflow

Good for high Volumes83

Matching

Recognize pairs of similar entity descriptions

Blocking

Group similar enough entity descriptions

O(n) if hash-based

O(b∙d2) b blocks of d descriptions (avg)O(n2) n entity descriptions

Reduce comparisons leading to non-matching entities

Experiment over 9M entity descriptions in a cluster of 15 VMs:

ER workflow without blocking: >200 hrsER workflow with blocking: 11 hrs

Fall 2020

Token Blocking [Papadakis et al. 2011]

Assume two clean sets KB1, KB2 of entity descriptions free of intra-overlapping (Clean-Clean ER)

Each distinct token ti of values of entity descriptions in KB1 ∪ KB2

corresponds to a blockEach block contains all entity descriptions sharing the corresponding token

Pairs originating from the same (clean) KB are not compared

Token blocking offers a brute-force method for comparing descriptions even if they are highly heterogeneousThe same pair of descriptions is contained in many blocks (redundant

comparisons)

Many dissimilar pairs are put in the same block (unnecessary comparisons)

84

Fall 2020

Token Blocking Example

Redundant comparisons for (e1, e4) contained in 5 different blocks!85

name Eiffel Tower

architect Sauvestre

year 1889

location Paris

name Statue of

Liberty


year 1886

located NY

about Lady liberty

architect Eiffel

location NY

about Eiffel Tower

architect Sauvestre

year 1889

located Paris

name White Tower


year-

constructed

1450

Eiffel

e1, e2,

e3, e4

Tower

e1, e4,

e5

Statue

e2

Liberty

e2, e3

White

e5

1889

e1, e4

Bartholdi

e2

NY

e2, e3

Sauvestre

e1, e4

Paris

e1, e4

1886

e2

1450

e5

Lady

e3

Thessaloniki

e5

Generated

Blocks

e1

e2

e3

e4

e5Actually, an inverted

index

Fall 2020

Token Blocking Example

Unnecessary comparisons between (e1, e3), (e2, e4), (e1, e5)86

name Eiffel Tower

architect Sauvestre

year 1889

location Paris

name Statue of

Liberty


year 1886

located NY

about Lady liberty

architect Eiffel

location NY

about Eiffel Tower

architect Sauvestre

year 1889

located Paris

name White Tower


year-

constructed

1450

Eiffel

e1, e2,

e3, e4

Tower

e1, e4,

e5

Liberty

e2, e3

1889

e1, e4

NY

e2, e3

Sauvestre

e1, e4

Paris

e1, e4

Generated

Blocks

e1

e2

e3

e4

e5Actually, an inverted

index

Fall 2020

Attribute Clustering Blocking [Papadakis et al. 2013]

Token blocking totally ignores the semantics of attributesWhen attribute mappings are not known, attribute clustering considers similarity

of attributes

Attribute similarity is computed w.r.t. the string similarities of their values

Two main steps:Similar attributes are placed together in non-overlapping clusters

Token blocking is performed on the descriptions of each cluster

87

Fall 2020

Attribute Clustering Blocking: Example

Finding the attribute of KB2 that is the most similar to “about” of KB1: values {Eiffel, Tower, Statue, Liberty, Auguste, Bartholdi, Joan}

compared to (with Jaccard similarity) :values of work: {Lady,Liberty,Eiffel,Tower,Bartholdi,Fountain} → Jaccard = 4/9values of artist: {Bartholdi} → Jaccard = 1/8values of location: {NY, Paris, Washington, D.C.} → Jaccard = 0values of year-constructed: {1889, 1876} → Jaccard = 0

89

about Eiffel Tower

architect Sauvestre

year 1889

located Paris e11

about Statue of Liberty


year 1886

located NY e12

about AugusteBartholdi

born 1834 e13

about Joan Tower

born 1938e14

work Lady Liberty

artist Bartholdi

location NYe15

work Eiffel Tower

year-constructed

1889

location Paris

e16

work Bartholdi Fountain

year-constructed

1876

location Washington D.C.

e17

Fall 2020


90

about Eiffel Tower

architect Sauvestre

year 1889

located Paris e11



year 1886

located NY e12


born 1834 e13

about Joan Tower

born 1938e14

work Lady Liberty

artist Bartholdi

location NYe15

work Eiffel Tower

year-constructed

1889

location Paris

e16


year-constructed

1876


e17

KB1 KB2aboutarchitectyearbornlocated

workartistyear-constructedlocation

Fall 2020


Compute the transitive closure of the generated attribute pairsConnected attributes form clusters

Example: Pairs (about, work), (work, about), (artist, architect), (architect, work)

91

KB1 KB2aboutarchitectyearbornlocated

workartistyear-constructedlocation

aboutwork

architectartist

C1year

year-constructedborn

C2locationlocated

C3

Fall 2020

Token Blocking in Each Attribute Cluster

C3.NY

e12, e15

92

about Eiffel Tower

architect Sauvestre

year 1889

located Paris e11



year 1886

located NY e12


born 1834 e13

about Joan Tower

born 1938e14

work Lady Liberty

artist Bartholdi

location NYe15

work Eiffel Tower

year-constructed

1889

location Paris

e16


year-constructed

1876


e17

aboutwork

architectartist

C1year

year-constructedborn

C2locationlocated

C3

→ compare Lady Liberty to Auguste Bartholdi

C1.Tower

e11, e14, e16

C1.Bartholdi

e12, e13, e15, e17

Fall 2020

Other Blocking Techniques

Similarity Join algorithms: relies on an inverted index from the most discriminating (non-frequent) tokens of the descriptions and the prefix filtering principle [Chaudhuri et al., 2006]

prefix filtering is effective only when the similarity threshold is extremely high Frequent itemsets blocking: build blocks for sets of tokens that

frequently co-occur in entity descriptionsmay significantly reduce the number of candidate pairs

may also significantly increase missed matches, especially between descriptions with few common tokens

Multidimensional blocking: constructs a collection of blocks for each similarity function used to resolve entities and aggregates them into a single multidimensional collection, by taking into account the similarities of descriptions that share blocks

96

Fall 2020

Placing Entities in the Same Block

97

Method Criterion

Token Blocking [Papadakis et al.,2011]

The descriptions have a common token in their values

Attribute Clustering Blocking [Papadakis et al., 2013]

The descriptions have a common token in the values of attributes that have similar values in overall

Prefix-Infix(-Suffix) [Papadakis et al., 2012]

The descriptions have a common token in their literal values, or a common URI infix

ppjoin+ [Vernica et al., 2010] The descriptions have a common token in their p first tokens (sorted in ascending frequency order)

Frequent itemsets [Kenig and Gal, 2013]

The descriptions have frequently co-occurring tokens in their values

Fall 2020

Blocking Algorithms Comparison

98

Method (suffix) Partitio

ning

Overlapping Algorithmpositiv

enegativ

eneutral

Hash-based

Sort-Based

Canopy Clustering[McCallum et al. 00]

Standard blocking [Fellegi& Sunter 69, Kolb et al.,12b]

Q-grams [Gravano et al., 01]

Suffixes [Aizawa & Oyama, 05]

Sorted neighborhood [Hernàndez & Stolfo95, Kolb et al.,12a]

Adaptive sorted neighborhood [Yan et al.,07]

Token blocking [Papadakis et al.,11]

Attribute clustering [Papadakis et al.,13]

Prefix-infix (suffix) [Papadakis et al. 12]

ppjoin+ [Vernica et al., 10]

Frequent itemsets [Kenig&Gal,13]

Schema Agnostic

Fall 2020

Matching and Resolving Entities (I): Block Post-Processing

Fall 2020Meta-blocking: Improving the Efficiency of Blocking

101

Eliminate redundant comparisons & reduce unnecessary comparisons

MatchingBlocking

Meta Blocking

Fall 2020

Meta-blocking – The Blocking Graph

102

Eiffel

e1, e4, e2, e3

Tower

e1, e4, e5

Liberty

e2, e3

Paris

e1, e4

NY

e2, e3

1889

e1, e4

e1

e5

e2

e4

e3

1

3

11

1

5

1

1

edge weights = #common blocks

e1

e5

e2

e4

e3

14 comparisons to identify 2 matches e1-e4 and e2-e3

2 comparisons to identify 2 matches

Blocks (ToB):

Blocking graph (Nodes: entity descriptions, Edges: common block)

Pruned blocking graph (discard edges with weight below avg.: 1.75)

Prune edges to discard unnecessary comparisons between non-matches based on positive overlapping evidence

Sauvestre

e1, e4

Fall 2020

Edge Weighting and Pruning

Weighting Schemes (how to weight the edges) Common Blocks (CBS): wi,j =|Bi,j|

Jaccard (JS): wi,j = |Bi,j|/(|Bi|+|Bj|-|Bi,j|)

Enhanced CBS (ECBS): wi,j = CBS • log(|B|/|Bi|) • log (|B|/|Bj|)

Pruning Methods (which edges to prune)WEP: Keep edges with weight above average

CEP: Keep top-K edges overall

WNP: Keep, for each node, the edges with weight above a local average

CNP: Keep, for each node, its top-K edges

103

Fall 2020

Matching and Resolving Entities (II): Iterative Resolution Techniques

Fall 2020

Central vs. Peripheral KBs

105

Fall 2020

In Search of Similarity Evidence in KBs [Efthymiou et al. 2016, 2015]

Attribute-based comparisonsunique attributes (e.g., rdfs:label) provide strong evidence

>90% of matching pairs have >80% overlap similarity in the values of rdfs:label

Content-based comparisonscentral KBs: 3-4 common tokens in entity values

peripheral KBs: 1-2 common tokens in entity values

blocking algorithms miss up to 30% matches in peripheral KBs Relationships-based comparisons

matching neighbors provide positive evidence

>92% of pairs with at least one matching neighbor, are matches in most KBs

some types of relationships provide strong negative evidence

dissimilar values for wasBornIn indicate a non-matching pair

106

Fall 2020

Types of Missed Matches (FNs)

107

▪ Type A: a third, matching description (transitivity)

▪ Type B: matches of their neighbors

applicable to identify matches within a KB

can identify matches both within a KB and across different KBs

directs

isdirected

Fall 2020

Iterative ER

108

MatchingBlocking

Iterative ER: identify new matches based on partial results either of matches or of merges

Increase the number of matching entities

▪Good for high Variety

Fall 2020

Iterative ER Approaches

Merging-based: new matches can be found by exploiting merged (more complete) descriptions of previously identified matchesIdea: ER resembles a database self-join operation (of the initial set of

descriptions with itself)

But: No knowledge about which descriptions may match, so all pairs of descriptions need to be compared

Matching-based: If descriptions related to entity ei are matching to descriptions related to ej, then ei and ej are likely to matchIdea: ER resembles to a graph traversal problem in which similarity is

propagated until a fixed point is reached

Use positive or negative evidence for prioritize similarity re-computation

109

Fall 2020

Matching and Resolving Entities (II): Merging-based Iterative Resolution

Fall 2020

Merging-based ER– Formal Definition

Let E = {e1, ..., em} be a set of entity descriptions,M : E ×E → {true, false} be a match function, and

: E ×E → E be a partial merge function

The merging-based (generic) resolution of entities in E is a set of descriptions E’, such that:

1.∀ei, ej ∈ E : M(ei, ej) = true, ∃ek ∈ E’: (ei, ej) ≤ ek

2.∀ek ∈ E’, ∀el ∈ E, el ≤ ek

3.no strict subset of E‘ satisfies conditions 1 and 2where el ≤ ek means that ek is at least as informative than el, regarding the same real-world entity

Note: E’ E (the merge closure)Condition 1: E’ cannot produce more than E

Condition 2: E’ produce at least all information of E

Condition 3: Minimal solution111

Fall 2020

Match and Merge Functions: ICAR Properties [Benjelloun et al., 2009]

Idempotence: a description always matches itself, and merging it with itself still yields the same description∀ei ∈ E if M(ei,ei) = true then (ei,ei) = ei

Commutativity: direction of match and merge is irrelevant∀ei,ej ∈ E if M(ei,ej) = M(ej,ei) = true then (ei,ej) = (ej,ei)

Associativity: order of merge is irrelevant∀ei,ej,ek ∈ E if ((ei,ej),ek) and (ei,(ej,ek) exist then ((ei,ej),ek) = (ei,(ej,ek)

Representativity: merging does not lose matches; no “negative evidence”if ek=(ei,ej) then ∀el ∈ E such that M(ei,el) = true also M(ek,el) = true

Transitivity is not assumed!

112

Fall 2020

Merge Domination & Monotonicity

When the match and merge functions satisfy the ICAR properties, there is a natural domination (partial) order of descriptions

Given two descriptions, e1 and e2, we say that e1 is merge domi-natedby e2, denoted e1≤e2, if M(e1,e2) = true and (e1,e2) = e2

e1 does not add information

▪ Merged descriptions always dominates the ones they have derived from

∀e1, e2 ∈ E such that M(e1,e2) = true, it holds that e1 ≤ (e1,e2) and e2 ≤ (e1,e2)

Match function is monotonic ∀e1, e2, e3 ∈ E If e1 ≤ e2 and M(e1,e3)=true, then M(e2,e3)=true

Merge function is monotonic

∀e1, e2, e3 ∈ E If e1 ≤ e2 and M(e1,e3) , then (e1,e3) ≤ (e2,e3)

∀e1, e2, e3 ∈ E If e1≤e3, e2≤e3 and M(e1,e2)=true, then (e1,e2) ≤ e3

113Generic Entity Resolution with Swoosh F. Naumann 20.6.2013

Fall 2020

Ordering Match & Merge to Get the Right Answer?

114

e1

e3

M(e2,e3)=truee2,3=(e2,e3)

M(e1,e2,3)=true

KB1:SKBRK

KB1:name “S. Kubrick”

KB1:birthPlace KB1:Manhattan

KB1:deathPlace KB1:UnitedKingdom

KB1:diedIn 1999

KB1:Stanley_Kubrick

KB1:birthPlace KB1:Manhattan

KB1:bornIn 1928-7-26

KB1:parents KB1:Gertrude Kubrick

KB1:parents KB1:Jacques Leonard Kubrick


KB2:SKubrick

foaf:name “Stanley Kubrick”

KB2:place_of_birth KB2:MNHT


KB2:activeYearsEndYear 7/3/1999

KB2:directed KB2:A_Clockwork_OrangeKB12:Stanley_Kubrick

birthPlace Manhattan

bornIn 1928-7-26

parents Gertrude Kubrick

parents Jacques Leonard Kubrick

rdf:type AmericanFilmDirectors

activeYearsEndYear 7/3/1999

directed A_Clockwork_Orange

e2

Fall 2020

R-Swoosh [Benjelloun et al., 2009]

Assumes ICAR and merge domination

Idea 1: if M(e1,e2) = true we can remove e1 and e2

Whatever would match e1 or e2 now also matches (e1,e2)

Representativity and associativity

Idea 2: Removal of dominated descriptions is not necessary as a last step in the algorithmAssume e1 and e2 appear in final answer where e1 ≤ e2

Then M(e1,e2) = true and (e1,e2) = e2

Thus comparison of e1 and e2 should have generated merged description e2, and e1 should have been eliminated

115Generic Entity Resolution with Swoosh F. Naumann 20.6.2013

Fall 2020

R-Swoosh Example

116

e1

e2

e3

E E '

e2

e3

e1

E E '

e23 e1

E E '

e123

E E '

e3

e1

e2

E E '

e123

E E '(a) (b) (c)

(d) (e) (f)

Fall 2020

ER with ICAR properties

General ER process is guaranteed to be finite Entity descriptions can be matched and merged in any order

Dominated entity descriptions can be discarded anytime

Union class of match and mergeUnion-merge: All values are kept in merged descriptions

Union-match: At least one of the values in common

Commutativity of match and merge functions for highly heterogeneous descriptions does not always hold

117

Fall 2020

Matching and Resolving Entities (II): Matching-based Iterative Resolution

Fall 2020

Matching-based Iterative ER

Pair-wise ER: matching decisions are made independentlyDeduplication, Linkage

Collective ER: matching decisions depend on other matching decisions according to positive and negative evidence (general constraints)Similarity propagation approaches (more scalable)

Dependency graphs of matching decisions

Collective relational clusteringProbabilistic approaches (scalability is an open issue)

Generative Models (for acyclic dependencies between match decisions)

Undirected Models based on Markov Networks (sometimes with a first-order logic syntax)

Hybrids of constraints and probabilistic models

119

Fall 2020

Similarity Propagation Approaches

A graph structure for encoding the similarity between entity descriptions and matching decisions, and iteratively assess matching of entities by propagating similarity valuesDetails of how the graph is constructed and traversed and how (content and

context) similarity is computed vary

Similarity-propagation ER: the match function is re-computed at each iteration step by considering previous matching decisions: Mn(ei,ej) = true, if simn-1(ei,ej) ≥

Mn(ei,ej) = false, if simn-1(ei,ej) ≤ ’

Mn(ei,ej)= undecided, otherwise

Total similarity: sim(ei,ej)= a*simnbr(ei,ej)+(1-a)*simnbr(nbr(ei),nbr(ej))

where nbr(e) denotes the neighborhood (in, out) nodes of e

120

Fall 2020

Maintaining the Order of Comparisons

In similarity-propagation approaches the order of comparisons isdynamic

Graph traversal usually supported by a priority queue (PQ) on the similarity score of nodesAs entities are resolved, the PQ is updated for maximizing effectiveness &

reducing re-comparisons

Different strategies of order maintenance:Based on heuristics

type of nodes and edge direction [Dong et al. 2005]

degree of nodes [Weis & Naumann 2006]

edge weights [Kalashnikov & Mehrotra 2006] Triggered by recent matches [Böhm et al. 2012, Lacoste-Julien et al. 2013]

Lazy maintenance [Herschel et al. 2012, Altowim et al. 2014]

121

Fall 2020Dependency Graph [Dong et al., 2005]

Works on an entity graph constructed from the relational records nodes represent similarity comparisons between pairs of records and/or their

attribute values (real-valued)

edges represent match decisions based on the matching of associated nodes (boolean-valued)

A matching decision is taken when the real-valued similarity score (between 0 and 1) of a node is above a threshold θIf it exceeds the threshold, it is marked as match, otherwise as undecided,

if no more neighbors are undecided, it is marked as non-match

Idea 1: consider richer matching evidence Idea 2: propagate similarity between matching decisions Idea 3: Gradually enrich references by merging attribute values

(Swoosh-style)

122

Fall 2020

Dependency Graph: Example

Let E be a set of entity descriptionsA node v = {ei,ej}, where ei,ej∈ E, i ≠ j

An edge e = (va,vb) from va={eai,eaj} to vb = {ebi, ebj} implies ebi , ebj∈ values(eai) values(eaj)

Directed edge when dependency is only in one directioncan be stated by domain experts

inferred by the data semantics (e.g., keys/forein keys, rdf properties)

Include only nodes whose two entities have the potential to be similar

123

v1

v2

v3 v4

Fall 2020

Consider Richer Matching Evidence [Dong et al., 2005]

Positive evidence (i.e., constraints for match nodes) is captured by the Boolean similarity of neighborhood nodesStrong-boolean: Resolution implies resolution of neighbour

E.g., if two movies are matched then director must also be matched

Weak-boolean: No direct implication

E.g., similarity of two movies increases as their rdf:labelsare highly similar

Negative evidence (i.e., constraints for non‐match nodes) is verified after similarity propagation is performed, and inconsistencies are fixed

124

Fall 2020

Similarity Propagation [Dong et al., 2005]

Similarity function for node u: sim(u) = Srv + Ssb + SwbSrv: from real-valued neighbors (decision-tree shape)Ssb: from strong-boolean-valued neighborsSwb: from weak-boolean-valued neighbors

When a node is matched, the similarity score of its neighbors is re-computed

Process converges ifSimilarity score is monotone in the similarity values of neighbors

Compute neighbor similarities only if similarity increase is not too small

125

Fall 2020

Traversing the ER Graph

126

Initially all nodes are active and placed in the PQA node is processed before its out-neigbors

h

a b

d

c

f

g

e

PQ

f

a

b

c

h

d

e

g

Fall 2020

PQ

b

c

h

d

e

g


127

h

a b

d

c

f

g

e

merged node

inactive node

Fall 2020

PQ

c

h

d

e

g


128

h

a b

d

c

f

g

e

Fall 2020

PQ

h

d

e

g


129

h

a b

d

c

f

g

e

Fall 2020

PQ

d

e

g


130

h

a b

d

c

f

g

e

Fall 2020

PQ

g

b

e


131

h

a b

d

c

f

g

e

Fall 2020

PQ

b

e


132

h

a b

d

c

f

g

e

Fall 2020

PQ

e

c


133

h

a b

d

c

f

g

e

Fall 2020

PQ

c


134

h

a b

d

c

f

g

e

Fall 2020

PQ


135

h

a b

d

c

f

g

e

Fall 2020

Linda [Böhm et al. 2012]

Works on an entity graph constructed from the RDF descriptionsexploits the unique mapping constraint between two KBs

Key Idea: the more matching neighbors via similar relationships two descriptions have, the more likely it is that they matchString similarity of the literal values of entities: checked once

Contextual similarity of the graph neighbors:checked iteratively Two square matrices (|E| × |E|) are used:

X captures the identified matches (binary values)

Y captures the pair-wise similarities (real values)

Initialization: common neighbors & string similarity of literals

Updates: Use the new identified matches of X Until the priority queue (extracted from Y) becomes empty:

Get the pair (ei, ej) with the highest similarity: match by default!

Update X: matches of ei are also matches of ej

Update the similarity of nodes influenced by the new matches

136

Fall 2020

Matches e1 e2 e3 e4 e5

e1 1 0 0 0 0

e2 1 0 0 0

e3 1 0 0

e4 1 0

e5 1

137

PQ

e1 – e4

e2 – e4

e1 – e3

e5 – e3

e2 – e3

…

e3 e4

e2 e1

e5

A priority queue, derived by an initial

similarity compu-tation between all

pairs, based on their attribute values

Linda Algorithm Example

Fall 2020


e1 1 0 0 1 0

e2 1 0 0 0

e3 1 0 0

e4 1 0

e5 1

138

PQ

e1 – e4

e2 – e4

e1 – e3

e5 – e3

e2 – e3

…

e3 e4

e2 e1

e5


the head of PQ is a match by default

Fall 2020


e1 1 0 0 1 0

e2 1 0 0 0

e3 1 0 0

e4 1 0

e5 1

139

PQ

e2 – e4

e1 – e3

e2 – e3

e5 – e3

…

e3 e4

e2 e1

e5

unique mapping constraint (1-1 Assumption)

similarity re-computation, based on the matching neighbors and the names of the links to them


directs

directs

Fall 2020


e1 1 0 0 1 0

e2 1 1 0 0

e3 1 0 0

e4 1 0

e5 1

140

PQ

e2 – e3

e5 – e3

…

e3 e4

e2 e1

e5

directs

directs


Fall 2020


e1 1 0 0 1 0

e2 1 1 0 0

e3 1 0 0

e4 1 0

e5 1

141

PQ

e5 – e3

…

e3 e4

e2 e1

e5

directs

directs

stops when PQ is empty


unique mapping constraint (1-1 Assumption)

Fall 2020

Linda Distributed VersionMatches e1 e2 e3 e4 e5

e1 1 0 0 0 0

e2 1 0 0 0

e3 1 0 0

e4 1 0

e5 1

142

PQ

e1 – e4

e2 – e4

e1 – e3

e2 – e3

…

e3 e4

e2 e1

e5knows

knows

PQ

e5 – e3

e5 – e4

…

Node 1

Node 2

Node 1

Node 2

The node to hold the PQ entry (a,b) and the respective vertex neighbors is determined by a modulo operation of the first component (a)

Fall 2020

Matching and Resolving Entities (II): Progressive Resolution Techniques

Fall 2020

Progressive ER

144

MatchingBlocking

Planning

Update

Optimization: maximize benefit (number or type of matches) for a given cost (number of comparisons, disk/cloud access)

Progressive ER: estimates which part of the data to resolve next and adapts this decision in a pay as you go fashion

▪Good for high Velocity

Fall 2020

Progressive Approach to Relational ER [Altowim et al. 2014]

Key Idea: divides the ER process into several windows and generates a resolution plan for each window

Lazy resolution strategy to resolve pairs with the smallest costUnlike single entity type resolution a block based prioritization is significantly

more important when resolving multiple types145

– specifies which blocks and entity pairs within these blocks will be resolved during the plan execution phase of a window

– associates with each identified pair the order in which to apply the similarity functions on the attributes of the two entities

Fall 2020

Relational Progressive ER: Example

146

A1 M1

A2 D2

D1

M2

v1

v2

v3

v4 v8

v7

v6

v5 v9

v10

v11

v12

Altowim et al. Progressive Approach to Relational ER VLDB 2014

Actors Movies Directors

Blocks with entities of the same type

A1 M1

v1

v2

v3 v7

v6

v5

Fall 2020

Matching Probability Estimation

The nodes in the ER graph influencing a pair of entities are interpreted as common causes to the same effect

The probability that a node is processed at each round depends on the number of cause nodes that are matches and/or on the percentage of matching nodes in the block of the effect node

147

…

Effect

Cause Cause Cause

Noisy-OR Model

not required to know all the causes

Effect: Node considered for matchingCauses: (a) Influencing matching nodes(b) Fraction of matching

pairs in the block of

vi

xj

viAltowim et al. Progressive Approach to Relational ER VLDB 2014

Fall 2020

Heuristics for Estimating Node Benefit

148

A2 D2M2

A1 M1 D1

v1

v2

v3 v7

v6

v5

v4 v8

v9

v10

v11

v12

v1 v2 M1

Fraction of matching

pairs in the block

A M

match

non-match

Indirect Benefit

Direct Benefit

The Impact is estimated for k nodes of the same entity type, as the average number of their direct dependent

unresolved nodes

Altowim et al. Progressive Approach to Relational ER VLDB 2014

Fall 2020

Traversing the ER Graph (in Blocks)

ab

d

c f

g

e

h

0.3

0.1

0.1

0.2

0.3

0.30.2

0.1

F1

A1

D1M1

M2

Assign to blocks the sum of their nodes’ scores:s(B) = σ𝑣

𝑖∈𝐵 P(vi) + P(vi) ∗ Impact(vi)

Initially, a block of the entity type with the highest out-degree (impact) is selected (in random), since P(vi) is the same for all nodes (of the same type)

D2

Impact(directors) = 3,Impact(actors) = 2,Impact(movies) = 1Impact(family) = 1Initial P(vi) = δ

s(F1)=s(e)= 0.1+0.1*1=0.2

s(Α1)=s(c)+s(h)=(0.1+0.1*2)*2=0.6

s(M1)=s(a)+s(d)= (δ+δ*1)*2=4δ

Fall 2020

ab

d

cf

g

e

h

0.3

0.1

0.1

0.2

0.3

0.30.2

0.1

F1

A1

D1M1

M2

Then, update the scores, load the block with the highest score and resolve its nodes

D2

s(Α1)=s(c)+s(h)=(0.1+0.1*2)*2=0.6

s(M1)=s(a)+s(d)= (δ+δ*1)*2=4δ

s(F1)=s(e)= 0.1+0.1*1=0.2





Fall 2020

ab

d

cf

g

e

h

0.3

0.1

0.1

0.2

0.3

0.30.2

0.1

F1

A1

D1M1

M2


D2

s(M1)=s(a)+s(d)= (δ+δ*1)+ (0.2+0.2*1)=2δ+0.4

s(F1)=s(e)= 0.1+0.1*1=0.2





Fall 2020

ab

d

cf

g

e

h

0.3

0.1

0.1

0.2

0.3

0.30.2

0.1

F1

A1

D1M1

M2


D2

s(D2)=s(g)= δ+δ*3=4δ

s(M2)=s(f)= δ+δ*1=2δ

s(F1)=s(e)= 0.1+0.1*1=0.2





Fall 2020

Comparison of ER Graph Traversals

Property [Dong et al. 2005] [Böhm et al. 2012] [Altowim et al. 2014]

Seed node selectionany node with no in-

edgesthe node with the

highest content_sim

any node of the entity type with the highest

out-degree

Matching conditionsimilarity is above a

threshold valuethe highest score (head

of PQ)a (black-box) resolve function returns true

Similarity/Evidence Update (PQ)

weighted sum of in-neighborssimilarities

weighted sum of matching in-/out-

neighbors similarities

leaky noisy-or of matching in-neighbors

Backtracking

Traversal policy

topological sort advance in the PQ the neighbors (with most

similar relationships) of the recent match

sort PQ based on #in-matches and #out-neighbors

Fall 2020

Web-scale Progressive ER: Ongoing Work

Relational progressive ER algorithms assume that the more entity pairs are correctly identified, the higher the quality of the result is expected to be

We are interested in characterizing the quality of resolved pairs

# of real-world entities resolved (shallow strategy)

entity-centric search (entity coverage)

# of real-world entity graphs resolved (deep strategy)

entity-centric recommendation (relationship completeness)

# descriptions resolved for the same real-world entity (tail strategy)

web-scale knowledge curation (attribute completeness)

154

Fall 2020

Matching and Resolving Entities (II): Challenges and Open Issues

Fall 2020

Challenges and Open Issues

Tight coupling of Blocking with Iterative Matching/MergingBetter control of block characteristics w.r.t. the entity similarity subsequently

used (see [J. Fisher et al. 2015]

Progressive ER with Quality Guaranteesguarantees (e.g., coverage) regarding the quality of matches/ merges w.r.t.

subsequent entity-centric services and data analysis tasks

ER for Big DataAlgorithms for high Velocity (see [D. Firmani et al. 2016]), Variety, and Volume

entity descriptions (see [Q. Wang et al. 2015] [L. Kolb et al. 2012])

Large-Scale ER TestbedsReal-world ground truth datasets for different match types and open source ER

platforms (see [Efthymiou et al. 2015, 2016])

156

Fall 2020

Challenges and Open Issues

Crowdsourced ER: reduce the crowdsourcing cost for obtaining ground truth (see [Chai et al. 2016]

[Gokhale et al. 2014] [Wang et al. 2012])

Temporal ER: resolve evolving entity descriptions and analyse the history of descriptions (see

[Dong & Tan 2015])

Uncertain ER: consider confidence scores when resolving certain & uncertain entity

descriptions (see [Gal 2014] [Demartini et al. 2013])

Privacy-aware ER: Trade-off between entity obfuscation techniques and ER results quality (see

[Whang & Garcia-Molina 2013])

157

Fall 2020

http://www.morganclaypoolpublishers.com/catalog_Orig/product_info.php?products_id=823

Fall 2020

The Minoan ER Framework

159http://csd.uoc.gr/~vefthym/minoanER

Fall 2020

License

These slides are made available under a Creative Commons Attribution-ShareAlike license (CC BY-SA 3.0): http://creativecommons.org/licenses/by-sa/3.0/

You can share and remix this work, provided that you keep the attribution to the original authors intact, and that, if you alter, transform, or build upon this work, you may distribute the resulting work only under the same or similar license to this one

160

Fall 2020

Questions?

161

Fall 2020 Entity Resolution in the Web of Datahy562/lectures20/lec09_Entity... · 2020. 11. 23. ·...

Documents

Transcript of Fall 2020 Entity Resolution in the Web of Datahy562/lectures20/lec09_Entity... · 2020. 11. 23. ·...