BOA - Bootstrapping Linked Data

18
AKSW, Universität Leipzig Daniel Gerber Axel-Cyrille Ngonga Ngomo

description

Most knowledge sources on the Data Web were extracted from structured or semi-structured data. Thus, they encompass solely a small fraction of the information available on the document-oriented Web. In this paper, we present BOA, an iterative bootstrapping strat- egy for extracting RDF from unstructured data. The idea behind BOA is to use the Data Web as background knowledge for the extraction of natural language patterns that represent predicates found on the Data Web. These patterns are used to extract instance knowledge from natu- ral language text. This knowledge is finally fed back into the Data Web, therewith closing the loop. We evaluate our approach on two data sets using DBpedia as background knowledge. Our results show that we can extract several thousand new facts in one iteration with very high ac- curacy. Moreover, we provide the first repository of natural language representations of predicates found on the Data Web.

Transcript of BOA - Bootstrapping Linked Data

Page 1: BOA - Bootstrapping Linked Data

AKSW, Universität Leipzig

Daniel Gerber Axel-Cyrille Ngonga Ngomo

Page 2: BOA - Bootstrapping Linked Data

Bootstrapping the Data Web

WeKEx@ISWC - http://boa.aksw.org17.01.2012 - Page

Motivation

๏ Most knowledge bases extracted from (semi)-structured data

๏ Only 15-20 % of information in structured data

๏ Semantic Web ⬌ Document Web

๏ How can we extract data from the document-oriented web?

2

Page 3: BOA - Bootstrapping Linked Data

Bootstrapping the Data Web

WeKEx@ISWC - http://boa.aksw.org17.01.2012 - Page

Idea I

3

dbpedia:Barack_Obama

dbpedia:Honolulu,_Hawaii

dbpedia:Democratic_Party

dbpedia:Michelle_Obama

dbpedia-owl:birthPlace

dbpedia-owl:party

dbpedia-owl:spouse

Page 4: BOA - Bootstrapping Linked Data

Bootstrapping the Data Web

WeKEx@ISWC - http://boa.aksw.org17.01.2012 - Page

Idea II

Barack Obama was born in Honolulu, Hawaii.

Barack Hussein Obama is a politician of the Democratic Party.

Obama married Michelle Robinson in 1992.

4

is a politician of the

married

was born in

Page 5: BOA - Bootstrapping Linked Data

Bootstrapping the Data Web

WeKEx@ISWC - http://boa.aksw.org17.01.2012 - Page

Idea III

5

is a politician of the married

was born in

Joseph Martin "Joschka" Fischer (born 1948-04-12) is a politician of the German Green Party.

Dietrich's only child, Maria Elisabeth Sieber, was born in Berlin on 13 December 1924.

Jackie Bouvier Kennedy Onassis who married John F. Kennedy was tied to the Auchinclosses via her sister's marriage into the Auchincloss family.

Page 6: BOA - Bootstrapping Linked Data

Bootstrapping the Data Web

WeKEx@ISWC - http://boa.aksw.org17.01.2012 - Page

Related Work

6

๏ ReadTheWeb Project: N(ever) E(nding) L(anguage) L(earner)

๏ PROSPERA: Scalable Knowledge Harvesting with High Precision

and High Recall

Page 7: BOA - Bootstrapping Linked Data

2

5

3

14

WeKEx@ISWC - http://boa.aksw.org17.01.2012 - Page

Bootstrapping the Data Web

The BOA approach

7

Data Web

Web

Corpora

BackgroundKnowledge

Patterns

SPARQL

Pattern Search

Filtering

Pattern Scoring

RDFGeneration

Use in nextiteration

Corpus Extraction

Crawler

Cleaner

Indexer

Knowledge Acquisition

Page 8: BOA - Bootstrapping Linked Data

Bootstrapping the Data Web

WeKEx@ISWC - http://boa.aksw.org17.01.2012 - Page 8

Knowledge acquisition

http://dbpedia.org/resource/Google“Google”http://dbpedia.org/ontology/subsidiaryhttp://dbpedia.org/resource/YouTube“Youtube”

SELECT ?x ?xLabel ?prop ?y ?yLabel ?domain ?rangeWHERE { ?x rdf:type dbpedia-owl:[Organisation|Person|Place] . ?x rdfs:label ?xLabel . ?y rdfs:label ?yLabel . [?y ?prop ?x | ?x ?prop ?y] . FILTER ( lang(?xLabel) = ‘en’ && lang(?yLabel) = ‘en’ ) . ?prop rdfs:range ?range . ?prop rdfs:domain ?domain . }

http://dbpedia.org/ontology/Companyhttp://dbpedia.org/ontology/Company

Page 9: BOA - Bootstrapping Linked Data

Bootstrapping the Data Web

WeKEx@ISWC - http://boa.aksw.org17.01.2012 - Page

Pattern Search

(1) Set of entities s and o connected through p(2) Find all sentences which contain s and o(3) Replace labels with variables (?D?, ?R?)

9

BOA pattern: BOA pattern mapping:

dbpedia-owl:spouse“?D? with his wife ?R?”

dbpedia-owl:spouse“?D? with his wife ?R?”

dbpedia-owl:spouse“?D? and his wife ?R?”

dbpedia-owl:spouse“?D? and her husband ?R?”

Page 10: BOA - Bootstrapping Linked Data

Bootstrapping the Data Web

WeKEx@ISWC - http://boa.aksw.org17.01.2012 - Page

Pattern Scoring - Support

10

Supportpattern should be used across several triples in background knowledge

subsidiary ↣ “?R? was acquired by ?D?”

๏ [Google, DoubleClick] ↣ 2

๏ [General Motors, Opel] ↣ 1

๏ [Cablevision, Rainbow Media] ↣ 4

Page 11: BOA - Bootstrapping Linked Data

Bootstrapping the Data Web

WeKEx@ISWC - http://boa.aksw.org17.01.2012 - Page

Pattern Scoring - Specificity

Specificitypattern should not be used by many pattern mappings

๏ subsidiary: “?D? agreed to buy ?R?”

๏ subsidiary: “?R? is a part of ?D?”

๏ foundationOrganisation: “?R? is a part of ?D?”

11

Page 12: BOA - Bootstrapping Linked Data

Bootstrapping the Data Web

WeKEx@ISWC - http://boa.aksw.org17.01.2012 - Page

Pattern Scoring - Typicity

12

Typicitypattern should be used to connect entities of correct type

๏ Hypercom was acquired by Verifone .

๏ Hypercom_ORG was_O acquired_O by_O Verifone_ORG ._O

๏ Maktoob was acquired by Yahoo!

๏ Maktoob_PER was_O acquired_O by_O Yahoo_ORG ._O

Page 13: BOA - Bootstrapping Linked Data

Bootstrapping the Data Web

WeKEx@ISWC - http://boa.aksw.org17.01.2012 - Page

RDF Generation

13

dbpedia-owl:spouse

‘‘Leyla Rodriguez Stahl’’@en

rdfs:label

‘‘Abel Pacheco’’@en

rdfs:label

dbpedia-owl:Person

rdf:type

dbpedia-owl:Person

rdf:type

Pacheco_PER arrived_O with_O his_O wife_O Leyla_PER Rodriguez_PER Stahl_PER and_O

?D? with his wife ?R?

Pacheco arrived with his wife Leyla Rodriguez Stahl and several...

boa:Leyla_Rodriguez_Stahldbpedia:Abel_PachecoNEW NEW

NEW

NEW

Page 14: BOA - Bootstrapping Linked Data

riverMouthmusicalArtistmusicalBandawardwriteralmaMateroccupationformerTeamdeathPlacebirthPlace

Place Person Organisation

137990

158697

327430

64239

551693

72820

# of

trip

les

is subjectis object

Bootstrapping the Data Web

WeKEx@ISWC - http://boa.aksw.org17.01.2012 - Page

Evaluation I

14

en-wiki en-news

Language english english

Topicgeneral

knowledgenews

# of lines 44.7M 256.1M

# of words 1,032.1M 5,068.7M

riverMouthmusicalArtistmusicalBandawardwriteralmaMateroccupationformerTeamdeathPlacebirthPlace

Place Person Organisation

# of

trip

les

is subjectis object

Page 15: BOA - Bootstrapping Linked Data

Bootstrapping the Data Web

WeKEx@ISWC - http://boa.aksw.org17.01.2012 - Page

Evaluation II

15

en-wikien-wikien-wiki en-newsen-newsen-news

LOC PER ORG LOC PER ORG

Triples extracted 1465 8817 2567 488 903 916

Triples in DBpedia 138 183 48 52 44 7

Evaluated Triples 100 (8) 100 (1) 100 (1) 100 (1) 100 (7) 100 (0)

Precision 90,5 97 99 61,5 73,5 91

New true Statements* 1200 8375 2494 268 631 827

Found pattern mappings 62 72 59 49 70 55

Found patterns 123k 136k 38k 569k 465k 92k

Scored patterns 1045 612 241 3832 7294 1077

* Number of extracted statements not found in DBpedia multiplied with the precision of our approach

Page 16: BOA - Bootstrapping Linked Data

Bootstrapping the Data Web

WeKEx@ISWC - http://boa.aksw.org17.01.2012 - Page

Future Work

๏ Iteration 1+

๏ Human feedback

๏ Pattern generalization

๏ Datatype Properties

๏ Languages/Corpora

๏ Webservices

16

Page 17: BOA - Bootstrapping Linked Data

Bootstrapping the Data Web

WeKEx@ISWC - http://boa.aksw.org17.01.2012 - Page

Conclusion

๏ No manual created seed patterns needed

๏ 95.5% Precision on DBpedia/Wikipedia

๏ Output easily integrable in LOD Cloud

๏ Library of natural-language representations of

formal relations, Demo

๏ Quasi language independent (German/Korean)

17

Page 18: BOA - Bootstrapping Linked Data

LOD2 Presentation . 02.09.2010 . Page http://lod2.eu

Thank you!Questions?

Daniel GerberJohannisgasse 26, Room 5-2104103 Leipzig, GermanySIMBA@AKSWhttp://bis.informatik.uni-leipzig.de/DanielGerberhttp://boa.aksw.orghttp://code.google.com/p/boa