Genera&ng Linked Data by Inferring the Semancs of Tables

17
Genera&ng Linked Data by Inferring the Seman&cs of Tables Varish Mulwad, Ph.D. 2015 h5p://ebiq.org/j/96

Transcript of Genera&ng Linked Data by Inferring the Semancs of Tables

Page 1: Genera&ng Linked Data by Inferring the Semancs of Tables

Genera&ngLinkedDatabyInferringthe

Seman&csofTables

VarishMulwad,Ph.D.2015h5p://ebiq.org/j/96

Page 2: Genera&ng Linked Data by Inferring the Semancs of Tables

Goal:Table=>LOD*

Name Team Posi&on HeightMichaelJordan Chicago ShooMngguard 1.98

AllenIverson Philadelphia Pointguard 1.83

YaoMing Houston Center 2.29

TimDuncan SanAntonio Powerforward 2.11

h5p://dbpedia.org/class/yago/NaMonalBasketballAssociaMonTeams

h5p://dbpedia.org/resource/Allen_Iverson Playerheightinmeters

dbprop:team

*DBpedia 2/49

Page 3: Genera&ng Linked Data by Inferring the Semancs of Tables

Goal:Table=>LOD*

Name Team Posi&on HeightMichaelJordan Chicago ShooMngguard 1.98

AllenIverson Philadelphia Pointguard 1.83

YaoMing Houston Center 2.29

TimDuncan SanAntonio Powerforward 2.11

@prefixdbpedia:<h5p://dbpedia.org/resource/>.@prefixdbo:<h5p://dbpedia.org/ontology/>.@prefixyago:<h5p://dbpedia.org/class/yago/>."Name"@enisrdfs:labelofdbo:BasketballPlayer."Team"@enisrdfs:labelofyago:NaMonalBasketballAssociaMonTeams."MichaelJordan"@enisrdfs:labelofdbpedia:MichaelJordan.dbpedia:MichaelJordanadbo:BasketballPlayer."ChicagoBulls"@enisrdfs:labelofdbpedia:ChicagoBulls.dbpedia:ChicagoBullsayago:NaMonalBasketballAssociaMonTeams.

RDFLinkedData

Allthisinacompletelyautomatedway*DBpedia 3/49

Page 4: Genera&ng Linked Data by Inferring the Semancs of Tables

Tablesareeverywhere!!…yet…

Theweb–154millionhighqualityrelaMonaltables

4/49

Page 5: Genera&ng Linked Data by Inferring the Semancs of Tables

Evidence–basedmedicine

Figure:Evidence-BasedMedicine-theEssenMalRoleofSystemaMcReviews,andtheNeedforAutomatedTextMiningTools,IHI2010

Evidence-basedmedicinejudgestheefficacyoftreatmentsortestsbymeta-analysesofclinicaltrials.KeyinformaMonisolenfoundintablesinarMcles

However,therateatwhichmeta-analysesarepublishedremainsverylow…hamperseffec=vehealthcaretreatment…

#ofClinicaltrialspublishedin2008

#ofmetaanalysispublishedin2008

5/49

Page 6: Genera&ng Linked Data by Inferring the Semancs of Tables

~400,000datasets~<1%inRDF

6/49

Page 7: Genera&ng Linked Data by Inferring the Semancs of Tables

2010PreliminarySystem

ClasspredicMonforcolumn:77%EnMtyLinkingfortablecells:66%

Examplesofclasslabelpredic=onresults:Column–NaMonalityPredicMon–MilitaryConflictColumn–BirthPlacePredicMon–PopulatedPlace

PredictClassforColumns

Linkingthetablecells

IdenMfyandDiscoverrelaMons

T2LDFramework

Page 8: Genera&ng Linked Data by Inferring the Semancs of Tables

SourcesofErrors

• Thesequen9alapproachleterrorsperco-latefromonephasetothenext• ThesystemwasbiasedtowardpredicMngoverlygeneralclassesovermoreappropriatespecificones• HeurisMcslargelydrivethesystem• AlthoughweconsidermulMplesourcesofevidence,wedidnotjointassignment

8/49

Page 9: Genera&ng Linked Data by Inferring the Semancs of Tables

Sampling AcronymdetecMon

Pre-processingmodules

QueryandgenerateiniMalmappings

2 1

GenerateLinkedRDF Verify(op9onal) Storeinaknowledgebase&publishasLOD

JointInference/Assignment

ADomainIndependentFramework

9/49

Page 10: Genera&ng Linked Data by Inferring the Semancs of Tables

QueryMechanism

MichaelJordan ChicagoBulls Shoo&ngGuard 1.98

{dbo:Place,dbo:City,yago:WomenArMst,yago:LivingPeople,yago:NaMonalBasketballAssociaMonTeams…}

ChicagoBulls,Chicago,JudyChicago… ………

Team

possibletypes possibleen99es

10/49

Page 11: Genera&ng Linked Data by Inferring the Semancs of Tables

Rankingthecandidates

Stringsimilaritymetrics

Stringincolumnheader Classfromanontology

11/49

Page 12: Genera&ng Linked Data by Inferring the Semancs of Tables

Rankingthecandidates

Stringsimilaritymetrics

Popularitymetrics

Stringintablecell EnMtyfromtheknowledgebase(KB)

12/49

Page 13: Genera&ng Linked Data by Inferring the Semancs of Tables

JointInferenceoverevidenceinatable

ü ProbabilisMcGraphicalModels

13/49

Page 14: Genera&ng Linked Data by Inferring the Semancs of Tables

AgraphicalmodelfortablesJointinferenceoverevidenceinatable

C1 C2 C3

R11

R12

R13

R21

R22

R23

R31

R32

R33

Team

Chicago

Philadelphia

Houston

SanAntonio

Class

Instance

14/49

Page 15: Genera&ng Linked Data by Inferring the Semancs of Tables

Parameterizedgraphicalmodel

C1 C2C3

𝝍𝟓

R11 R12 R13 R21 R22 R23 R31 R32 R33

𝝍𝟑

𝝍𝟑

𝝍𝟑

𝝍𝟒

𝝍𝟒

𝝍𝟒

FuncMonthatcapturestheaffinitybetweenthecolumnheadersandrowvalues

Rowvalue

VariableNode:Columnheader

CapturesinteracMonbetweencolumnheaders

CapturesinteracMonbetweenrowvalues

FactorNode

15/49

Page 16: Genera&ng Linked Data by Inferring the Semancs of Tables

Challenge:InterpreMngLiterals

Popula&on

690,000

345,000

510,020

120,000

Age

75

65

50

25

PopulaMon?Profitin$K?

Ageinyears?Percent?

Manycolumnshaveliterals,e.g.,numbers

• PredictproperMesbasedoncellvalues• Cychadhandcodedrules:humansdon’tlivepast120• Weextractvaluedistribu9onsfromLODresources•  Differforsubclasses:ageofpeoplevs.poli9calleadersvs.athletes•  Representasmeasurements:value+units

• Metric:possibility/probabilityofvaluesgivendistribuMon16/49

Page 17: Genera&ng Linked Data by Inferring the Semancs of Tables

OtherChallenges•  Usingtablecap9onsandothertextisassociateddocumentstoprovidecontext

•  Sizeofsomedata.govtables(>400Krows!)makesusingfullgraphicalmodelimpracMcal– Sampletableandrunmodelonthesubset

•  Achievingacceptableaccuracymayrequirehumaninput– 100%accuracyuna5ainableautomaMcally– Howbesttolethumansofferadviceand/orcorrectinterpretaMons?

17/49