A Hitchhiker’s guide to OntologyYAGO ElvisPedia 64. Match classes, entities, & relations...
Transcript of A Hitchhiker’s guide to OntologyYAGO ElvisPedia 64. Match classes, entities, & relations...
A Hitchhiker’s guideto Ontology
Fabian M. SuchanekTelecom ParisTech UniversityParis, France
me>13intro>19
1
permanent
money freedom
2003:
2005:
2008:
Fabian M. Suchanek
22
money
permanent
freedom
2003: BSc in Cognitive ScienceOsnabruck University/DE
2005: MSc in Computer ScienceSaarland University/DE
2008: PhD in Computer Science
Max Planck Institute/DE
Fabian M. Suchanek
33
2003: BSc in Cognitive ScienceOsnabruck University/DE
2005: MSc in Computer ScienceSaarland University/DE
2008: PhD in Computer Science
Max Planck Institute/DE
money
permanent
freedom
Fabian M. Suchanek
44
2009: PostDoc at Microsoft ResearchSilicon Valley/US
permanent
money freedom
Fabian M. Suchanek
55
2009: PostDoc at Microsoft ResearchSilicon Valley/US
permanent
money freedom
Fabian M. Suchanek
66
2009: PostDoc at Microsoft ResearchSilicon Valley/US
2010: PostDocINRIA Saclay/FR
money
permanent
freedom
Fabian M. Suchanek
77
2009: PostDoc at Microsoft ResearchSilicon Valley/US
2010: PostDocINRIA Saclay/FR
money
permanent
freedom
Fabian M. Suchanek
88
2009: PostDoc at Microsoft ResearchSilicon Valley/US
2010: PostDocINRIA Saclay/FR
2012: Research group leader
Max Planck Institute/DE
permanent
money freedom
Fabian M. Suchanek
99
2009: PostDoc at Microsoft ResearchSilicon Valley/US
2010: PostDocINRIA Saclay/FR
2012: Research group leader
Max Planck Institute/DE
permanent
money freedom
Fabian M. Suchanek
1010
2009: PostDoc at Microsoft ResearchSilicon Valley/US
2010: PostDocINRIA Saclay/FR
2012: Research group leader
Max Planck Institute/DE2013: Associate Professor
Telecom ParisTech/FRmoney
permanent
freedom
Fabian M. Suchanek
1111
2009: PostDoc at Microsoft ResearchSilicon Valley/US
2010: PostDocINRIA Saclay/FR
2012: Research group leader
Max Planck Institute/DE2013: Associate Professor
Telecom ParisTech/FRmoney
permanent
freedom
Fabian M. Suchanek
1212
2009: PostDoc at Microsoft ResearchSilicon Valley/US
2010: PostDocINRIA Saclay/FR
2012: Research group leader
Max Planck Institute/DE2013: Associate Professor
Telecom ParisTech/FR2016: Full Professor
Telecom ParisTech/FR
money
permanent
freedom
Fabian M. Suchanek
1313
2009: PostDoc at Microsoft ResearchSilicon Valley/US
2010: PostDocINRIA Saclay/FR
2012: Research group leader
Max Planck Institute/DE2013: Associate Professor
Telecom ParisTech/FR2016: Full Professor
Telecom ParisTech/FR
money
permanent
freedom
Fabian M. Suchanek
1414
15
I am an Elvis Fan!
1961
15
Old Elvis from Sachsmedia
At Elvis’ ranchin New Zealand
Ranch is Murchison Lodge in New Zealand 17
Meeting Elvis Presley
1961
2015
17
18
Knowledge Cards
Elvis Presley
18
19
Knowledge Cards
Elvis Presley
???
19
For us, a knowledge base (“KB”, “ontology”) is a graph, wherethe nodes are entities and the edges are relations.
20
Knowledge Bases
type
born 1935
Singer
Person
subclassOf
(We do not distinguish T-Box and A-Box.)
20
Miningknowledge
bases->
DeterminingIncompleteness->
A ∧B ⇒ C
Protectingknowledge bases->
Creativly extendingknowledge bases ->
Constructingknowledge bases
21
Singer
type
born
Knowledge Base Life Cycle
1935
21
Elvis Presley was one
of the best blah blahblub blah don’t readthis, listen to thespeaker! blah blah blahblubl blah you are stillreading this! blah blah
blah blah blabbel blah
Born: 1935In: Tupelo...
Categories:Rock&Roll, American Singers,Academy Award winners...
Extracting from Wikipedia
22
Elvis Presley
ElvisPresley
AmericanSinger
1935 Tupelo
USAAcAward
type
bornIn
locatedInwon
born
22
See it online
Intermediate extractors• clean facts• deduplicate facts and entities• check consistency
Extractor
GeoNames Extractor
Extractor
Extractor
23
Creating a large knowledge base
ensuring high quality (95%)
Extractor
Extractor
ExtractorWordNet
See it locally23
Intermediate extractors• clean facts• deduplicate facts and entities• check consistency
See it online See it locally
multilingual>29
Extractor
GeoNames
WordNet
Extractor
Extractor
Extractor
24
Creating a large knowledge base
ensuring high quality (95%)
Extractor
Extractor
Extractor
24
Extractor
25
Integrating multilingual Wikis
was born
e nato
est ne
Extractor
25
Extractor
26
was born
e nato
est ne
map?
Extractor
Integrating multilingual Wikis
26
Extractor
27
was born
e nato
est ne
?Extractor
Extractor
Integrating multilingual Wikis
27
<Elvis,USA><Mao,China>
<Elvis,USA><Mao,China>
28
bornIn
est ne
Integrating multilingual Wikis
28
<Elvis,USA><Mao,China>
<Elvis,USA><Mao,China>
est ne= bornIn
P (S1 ⊇ S2) > θ
29
bornIn
est ne
[CIDR 2015]
?
Integrating multilingual Wikis
29
<Elvis,USA><Mao,China>
<Elvis,USA><Mao,China>
est ne= bornIn
P (S1 ⊇ S2) > θ
30
bornIn
est ne
[CIDR 2015]
?
Integrating multilingual Wikis
30
Example: YAGO about Elvis
3131
Wikipedia + WordNettime and space10 languages
100 relations100m facts10m entities95% accuracyused by DBpedia
and IBM Watson
http://yago-knowledge.org
Caveat:focus onprecision!
canonic&IBEX>42 32
YAGO: a large knowledge base
[WWW2007,JWS2008,WWW2011 demo,AIJ2013,WWW2013 demo,CIDR2015]
canonic>37
soon open-source!
32
ElvisPresley
Mr.Presley
Mr.Obama
ElvisHurricane
Canonicalizing New Entities
33
Elvis?
??
??
?
?
33
Use hierarchical agglomerative clustering
• TF-IDF token overlap• Triple overlap• String Similarity
•Word overlap in source docs• Type overlap• Overlap of co-mentions
ElvisHurricane
ElvisPresley
Mr.Presley
Mr.Obama
details>36
Canonicalizing New Entities
34
Elvis
34
• TF-IDF token overlap• Triple overlap• String Similarity
•Word overlap in source docs• Type overlap• Overlap of co-mentions
MachineLearning
Use hierarchical agglomerative clustering
ElvisHurricane
ElvisPresley
Mr.Presley
Mr.Obama
Canonicalizing New Entities
35
Elvis
35
• TF-IDF token overlap• Triple overlap• String Similarity
•Word overlap in source docs• Type overlap• Overlap of co-mentions
MachineLearning
MachineLearning
Use hierarchical agglomerative clustering
ElvisHurricane
ElvisPresley
Mr.Presley
Mr.Obama
Canonicalizing New Entities
36
Elvis
36
• TF-IDF token overlap• Triple overlap• String Similarity
•Word overlap in source docs• Type overlap• Overlap of co-mentions
MachineLearning
MachineLearning
Use hierarchical agglomerative clustering
ElvisHurricane
ElvisPresley
Mr.Presley
Mr.Obama
Canonicalizing New Entities
37
Elvis
[CIKM 2014]
37
ElvisHurricane
ElvisPresley
Mr.Presley
Mr.Obama
IBEX>42
Canonicalizing New Entities
38
Elvis
[CIKM 2014]
38
39
Goal: Harvest entities from the Web
39
c© Whittleseacitylac
Unique identifiers can be verified by a checksum.They exist for products, books, documents, chemicals,...
40
IBEX: Collect unique ids
887128476661
40
c© Whittleseacitylac 41
IBEX: Collect unique ids
id name
1234 Puma PowerTech Blaze1234 Please choose
1234 Puma Shoe1234 Puma PowerTech Blaze
5678Sony Cybershot TS1005678Please choose
... ...
URLu1u1u2u2u3u3...
41
Found• 13m email addresses
with their name• 235K chemical products
• 1.4m books• 1.1m products... with an accuracy of 73%-96%
Analyzed
• Global trade flow• frequent email providers• frequent people names and more
All data available online athttp://resources.mpi-inf.mpg.de/d5/ibex/
42
IBEX: analyses
[WebDB 2015]
42
Miningknowledge
bases->
DeterminingIncompleteness->
A ∧B ⇒ C
Protectingknowledge bases->
Creativly extendingknowledge bases ->
Constructingknowledge bases
43
Singer
type
born
Knowledge Base Life Cycle
1935
43
married(x, y) ∧ hasChild(x, z)⇒ hasChild(y, z)
PCA>49
Rule Mining finds patterns
married
hasChild hasChild
4444
But: Rule mining needs counter examplesand RDF ontologies are positive only
married(x, y) ∧ hasChild(x, z)⇒ hasChild(y, z)
PCA>49
Rule Mining finds patterns
45
hasChild hasChild
married
45
Assumption:If we know r(x,y1),..., r(x,yn),then all other r(x,z) are false.
Partial Completeness Assumption
46
hasChild
46
Assumption:If we know r(x,y1),..., r(x,yn),then all other r(x,z) are false.
Partial Completeness Assumption
47
hasChild
47
Assumption:If we know r(x,y1),..., r(x,yn),then all other r(x,z) are false.
conf>49
Partial Completeness Assumption
48
hasChild
48
~B ⇒ r(x, y)
married(z, x) ∧ hasChild(z, y)⇒ hasChild(x, y)
conf( ~B ⇒ r(x, y)) = #x,y: ~B∧r(x,y)#x,y:∃y′: ~B∧r(x,y′)
# of parent-child
relations that wepredict correctly
in the KB
# of parent-child
relations that wepredict and where
a child is known
Partial Completeness Assumption
4949
Caveat: rules cannotpredict the unknownwith high precision
AMIE is based on an efficient in-memory databaseimplementation.
r(x, y) ∧ r′(z, y)⇒ r”(x, z)
AMIE finds rules in ontologies
AMIE
(5min)
5050
type(x, pope)⇒diedIn(x,Rome)
AMIE finds rules in ontologies
51
[WWW 2013, VLDB journal 2015]
AMIE
(5min)
>Rosa
51
Matching heterogeneous KBs
52
USA
bornInCountryUSA
TupelobornInCity locatedIn
52
1. Coalesce the KBs
53
USAbornInCountry
bornInCityTupelo
locatedIn
53
bornInCity(x, y) ∧ locatedIn(y, z)⇒ bornInCountry(x, z)
Caveat:Spurious
correlations
2. Mine rules
54
USAbornInCountry
AMIE
“ROSA rule”
TupelobornInCity locatedIn
54
⇒ bornInCountry(x, z)bornInCity(x, y) ∧ locatedIn(y, z)
ROSA rules match ontologies
55
[AKBC 2013]
“ROSA rule”
55
Miningknowledge
bases->
DeterminingIncompleteness->
A ∧B ⇒ C
Protectingknowledge bases->
Creativly extendingknowledge bases ->
Constructingknowledge bases
56
Singer
type
born
Knowledge Base Life Cycle
1935
56
Incompleteness
59
marriedTo
c© Girl Happy
?
marriedTo
Given a subject s anda relation r, do we knowall o with r(s, o) ?
59
Signals for Incompleteness
60
marriedTo
?
marriedTo
Closed World AssumptionPartial Completeness AssumptionPopularity oracle
AMIE oracle: Learn rules such asmoreThan1(x, hasParent)⇒ complete(x, hasParent)
No-change oracleStar-pattern oracle
Class-oracle
>Details
60
Signals for Incompleteness (F1)
61* = biased training sample61
Signals for Incompleteness
62
marriedTo
marriedTo
AMIE can predict incompleteness
• bornIn: 100% F1-measure• diedIn: 96%• directed: 100%• graduatedFrom: 87%
• hasChild: 78%• isMarriedTo: 46%... and more.
[WSDM 2017]
>paris
62
If incomplete, use another KB
63
birthDate
1935
type
RockSinger
married
1935-01-08
born
Singer
type
plays
YAGO ElvisPedia
63
Wo is the spouseof the guitar
player?
No Links => No Use
64
birthDate
1935
type
RockSinger
married
1935-01-08
born
Singer
type
plays
YAGO ElvisPedia
64
Match classes, entities, & relations
subClassOf
sameAs
subPropertyOf
65
birthDate
1935
type
RockSinger
married
1935-01-08
born
Singer
type
plays
YAGO ElvisPedia >details
65
Match classes, entities, & relations
”Elvis” ”Elvis”
name label
6666
1. Match literals
Match classes, entities, & relations
”Elvis”
name label
”Elvis”
6767
1. Match literals
Match classes, entities, & relations
”Elvis”
name label
”Elvis”
6868
2. Assume small equivalence of all relations
Match classes, entities, & relations
”Elvis”
name label
”Elvis”
6969
2. Assume small equivalence of all relations
Match classes, entities, & relations
”Elvis”
name label
”Elvis”
7070
What about matching the entities?
What does it mean that both Elvises share the same name?
Match classes, entities, & relations
”Elvis”
name label
”Elvis”
71>fun
71
ifun(r, y) = 1#x:r(x,y)
ifun(r) = HMyifun(r, y)
Inverse Functionality
”Elvis”
name
1935
born
ifun(name, Elvis)= 1 / 2ifun(born, 1935)= 1/10
ifun(name)= 0.9ifun(born)= 0.1
72>fun
72
ifun(r) = #x:∃y:r(x,y)#x,y:r(x,y)
# of subjectsdivided by
# of factsifun(r) = HMyifun(r, y)
Inverse Functionality
”Elvis”
name
1935
born
7373
3. If subjects share a relation that is highlyinverse functional, and the object is matched,then match the subjects.
Match classes, entities, & relations
”Elvis”
name label
”Elvis”
7474
4. If relations share many pairs,
increase their match
Match classes, entities, & relations
”Elvis”
name label
”Elvis”
7575
4. If relations share many pairs,
increase their match
Match classes, entities, & relations
”Elvis”
name label
”Elvis”
7676
5. Iterate
Caveat: Convergence proof only in subcase
P (r1 ⊆ r2) = 42φγ∑...P (e1 = e2)...
P (e1 ≡ e2) =∏1
42αβ...P (r1 ⊆ r2)...
Match classes, entities, & relations
”Elvis”
name label
”Elvis”
7777
6. Compute class subsumption
P (c1 ⊆ c2) = arcsin(4.1125)× P (e1 ≡ e2)× ...
Match classes, entities, & relations
singer AmericanSinger
type type
7878
Caveat: Class alignmentworks with “only” 85%precision due tospurious correlation.
6. Compute class subsumption
Match classes, entities, & relations
AmericanSingersinger
type type
7979
PARIS matches DBpedia & YAGO
• in 2 hours• with 90% accuracy
PARIS:match entities,classes,relations
80
[VLDB 2012, APWeb 2014 invited paper]
80
Miningknowledge
bases->
DeterminingIncompleteness->
A ∧B ⇒ C
Protectingknowledge bases->
Creativly extendingknowledge bases ->
Constructingknowledge bases
81
Singer
type
born
Knowledge Base Life Cycle
1935
81
People may “steal” from other ontologies without giving due credit.Most ontologies have licenses that require attribution.
stolen
Plagiarism
YAGO
cc
82
EvilPedia
82
By adding a few fake facts to the source ontology,one can prove theft in the target ontology.
Additive Watermarking
[WWW 2012 demo]83
stolen
YAGOEvilPedia
83
One can also prove theft by selectivelyremoving facts from the source ontology.
details>76
details&DIVINA>99
Subtractive Watermarking
[ISWC 2011]84
YAGO
stolen
EvilPedia
84
Subtractive Watermarking
85
YAGO
85
ratio of overlap
ratio of covered holes
≈
An Innocent KB
86
YAGOElvisPedia
86
high overlap
no holes covered
but
Probability that thishappens by chance
overlap ratio
number of holes
α :
n :
∼ (1− α)n
DIVINA>99
A Stolen KB
87
YAGO
Dieb-i-pedia
(Argument by Mauro Sozio; original paper uses chi square test)87
VectorPortalc©Ethan Hill 88
Protecting the user
Mat Honan
CC by-sa88
VectorPortalc©Ethan Hill 89
Protecting the user
Mat Honan
CC by-sa89
VectorPortalc©Ethan Hill
resetpasswordby credit card
90
Protecting the user
Mat Honan
CC by-sa90
VectorPortalc©Ethan Hill
resetpasswordby credit card
resetpasswordby backup email
91
Protecting the user
Mat Honan
CC by-sa91
VectorPortalc©Ethan Hill
resetpasswordby credit card
resetpasswordby backup email
resetpasswordby backup email
92
Protecting the user
Mat Honan
CC by-sa Mat’s story92
VectorPortalc©Ethan Hill
resetpasswordby credit card
resetpasswordby backup email
resetpasswordby backup email
Owen’s story
Two-factorauthentication
93
Protecting the user
Mat Honan
CC by-sa Mat’s story93
CC by-sa VectorPortalc©Ethan Hill
resetpasswordby credit card
resetpasswordby backup email
resetpasswordby backup email
Two-factorauthentication
Security: How many keys does a hacker need?Safety: How many keys can I afford to lose?Protection: How many keys to destroy my data?
94
Protecting the user
Mat Honan
Owen’s storyMat’s story94
gmail-destructiongmail-access
gmail-2ndfactor
gmail-
trusteddevice
phone-
authen-tication
phone-app
phone-
SMS
mac-access
dell-access
phone-possession
phone-password
phone-
simPin
gmail-password
gmail-1stfactor
gmail-recovery
code
yahoo-
data
95
Account Dependenciesgmail-data
95
https://suchanek.name/programs/divina96
DIVINADiscovering Vulnerabilities of Internet Accounts
[WWW 2015 demo]
96
Miningknowledge
bases->
DeterminingIncompleteness->
A ∧B ⇒ C
Protectingknowledge bases->
Creativly extendingknowledge bases ->
Constructingknowledge bases
97
Singer
type
born
Knowledge Base Life Cycle
1935
97
98
Combinatorial Creativity
c© BetterThanPants.com c© Vileda
+ ( - ) +
c© Avsar Arasc© Stone Mens Wear98
99
Description Logics do not work
c© Vileda
+ ( - ) +
c© Avsar Arasc© Stone Mens Wear
BabyMop ≡Romper u ∃has.(Mop u ¬∃has.Stick) u ∃has.Baby≡ ⊥
Mop ≡ Tool u ∃has.Stick u ∃has.Strings
99
100
Language for Combinatorial Creativity
c© Vileda
+ ( - ) +
c© Avsar Arasc© Stone Mens Wear
Romper + ∃has.(Mop − ∃has.Stick) + ∃has.Baby≡ BabyMop
Mop ≡ Tool u ∃has.Stick u ∃has.Strings
Mop − ∃has.> ≡ Tool u ∃StringsMop + ∃has.> ≡ Mop
Mop → ∃r.> ≡ Stick
Mop ↑ ∃has.> ≡ ∃has.Stick
Subtraction:Addition:Succession:Selection*:
100
Language for Combinatorial CreativityMop ≡ Tool u ∃has.Stick u ∃has.Strings
[ISWC 2016paper & demo]
1) Descriptive experiments 2) Generative experiments1/3 nonsense, 1/3 exists,1/3 “imaginable”
Mop − ∃has.> ≡ Tool u ∃StringsMop + ∃has.> ≡ Mop
Mop → ∃r.> ≡ Stick
Mop ↑ ∃has.> ≡ ∃has.Stick
Subtraction:Addition:Succession:Selection*:
>LeMonde
101
102
Another creative idea...
1944 2013
102
103
Mining Le Monde
time place entity
1967 USA
103
red: many foreign companies mentionedblue: few foreign companies mentioned
104
Presence of foreign companies
104
105
[AKBC 2013]
[VLDB 2014 vision]
Average age of people mentioned
105
Miningknowledge
bases->
DeterminingIncompleteness->
A ∧B ⇒ C
Protectingknowledge bases->
Creativly extendingknowledge bases ->
Constructingknowledge bases
106
Singer
type
born
Knowledge Base Life Cycle
1935
106
107
Is Elvis dead?
1977died
Elvis
???
107
100m statements95% precision-> 5m wrong statements
108
Is Elvis dead?
1977died
Elvis
???
108
Miningknowledge
bases->
DeterminingIncompleteness->
A ∧B ⇒ C
Protectingknowledge bases->
Creativly extendingknowledge bases ->
Constructingknowledge bases
109
Singer
type
born
Knowledge Base Life Cycle
1935
Slides done with Powerline, my free SVG slide editor with LaTex support109