Download - Extending DBpedia with Wikipedia List Pages

Transcript
Page 1: Extending DBpedia with Wikipedia List Pages

10/22/13 Heiko Paulheim, Simone Paolo Ponzetto 1

Extending DBpedia with Wikipedia List Pages

Heiko Paulheim, Simone Paolo Ponzetto

Page 2: Extending DBpedia with Wikipedia List Pages

10/22/13 Heiko Paulheim, Simone Paolo Ponzetto 2

Disclaimer

• This presentation shows an idea

– after all, it says “position paper”

– We don't know if it works!

– (but we are quite confident)

Page 3: Extending DBpedia with Wikipedia List Pages

10/22/13 Heiko Paulheim, Simone Paolo Ponzetto 3

Lists in Wikipedia

• Wikipedia loves lists

• As of June 2013, there are almost 600,000 list pages

• Lists organize Wikipedia pages

– that correspond to DBpedia instances

• Example:

– List of African-American writers

Page 4: Extending DBpedia with Wikipedia List Pages

10/22/13 Heiko Paulheim, Simone Paolo Ponzetto 4

Lists in Wikipedia

Page 5: Extending DBpedia with Wikipedia List Pages

10/22/13 Heiko Paulheim, Simone Paolo Ponzetto 5

Lists in Wikipedia

• Different types of lists

– simple bullet point lists

– broken bullet point lists (i.e., different sections)

• sometimes, the sections are semantically meaningful

– tables

– ...

Simple Bullet List

Broken Bullet List

Table

Other

Page 6: Extending DBpedia with Wikipedia List Pages

10/22/13 Heiko Paulheim, Simone Paolo Ponzetto 6

Lists in Wikipedia

• What information is in a list?

– the linked things have the same “type”

• The type can be a complex construct

– e.g.,

• Sometimes, there are more information bits

– e.g., birth dates for persons

Writer∩∀ nationality.{United States}∩∀ ethnicity.{African American}

Page 7: Extending DBpedia with Wikipedia List Pages

10/22/13 Heiko Paulheim, Simone Paolo Ponzetto 7

Extracting Information from Lists

• Goal:

– find the common characteristics of all things in the list

• Example: African-American writers

– all instances are writers

– all instances have nationality=United_States

– all instances have ethnicity=African_American

• Information in DBpedia is far from complete

– makes extraction difficult

– but: big potential to add information to DBpedia

25%

12%

3%

Page 8: Extending DBpedia with Wikipedia List Pages

10/22/13 Heiko Paulheim, Simone Paolo Ponzetto 8

Extracting Information from Lists

• Possible approach: finding characteristics with high TF-IDF

– TF: percentage of instances in the list that carry characteristic

– IDF: 1 / (percentage of all DBpedia instances that carry characteristic)

• Rationale: only going by frequency would rate owl:Thing the highest

• Example: African-American writers

– type=Writer: 0.608 (maximal across all possible classes)

– nationality=United_States: 0.277

– ethnicity=African_American: 0.127

• But:

– deathPlace=New_York_City: 0.157 :-(

Page 9: Extending DBpedia with Wikipedia List Pages

10/22/13 Heiko Paulheim, Simone Paolo Ponzetto 9

Extracting Information from Lists

• Example: African-American writers

– ethnicity=African_American: 0.127

– deathPlace=New_York_City: 0.157

• Exploit further information from list page

– e.g., wiki:African_American is linked from page, New_York_City is not

– e.g., analyze list page title, e.g., using DBpedia Spotlight

• African_American is recognized as an entity

Page 10: Extending DBpedia with Wikipedia List Pages

10/22/13 Heiko Paulheim, Simone Paolo Ponzetto 10

Lists of Lists in Wikipedia

• Wikipedia also knows ~600 lists of lists

– organize lists

– form a hierachy

• E.g.:

– Lists of Writers

– Lists of American writers

– List of African American writers

Page 11: Extending DBpedia with Wikipedia List Pages

10/22/13 Heiko Paulheim, Simone Paolo Ponzetto 11

From Lists of Lists to an Extended Ontology

• Idea:

– find corresponding lists of... pages for DBpedia classes

– extend hierarchy

owl:Thing

Agent

Person

Artist

Writer

...

...

...

... Lists of Writers

American Writer Lists of American Writers

African-American Writer ... List of African-American Writers

...

DBpedia Ontology

Extended Ontology

Corresponding Wikipedia page:

Page 12: Extending DBpedia with Wikipedia List Pages

10/22/13 Heiko Paulheim, Simone Paolo Ponzetto 12

Potential of the Idea

• Given that we extract everything correctly from List of African American writers, we get

– 814 new type statements (only DBpedia ontology)

– 1409 new property assertions

– two entirely new instances

• ...and there are ~600,000 list pages

– extrapolation: we can roughly double the information in DBpedia

• many list pages contain extra information

– e.g., birth places and birth dates of persons

Page 13: Extending DBpedia with Wikipedia List Pages

10/22/13 Heiko Paulheim, Simone Paolo Ponzetto 13

Challenges

• Robust extraction of instances

– from different kinds of list pages

– e.g., picking the right column in a table

– tables and bullet point lists already make for 75%

• Picking good scoring functions

– TF-IDF seems not bad at first glance

• Combining statistical and textual evidence

• Scalable implementation

– Advantage: perfectly parallelizable

Page 14: Extending DBpedia with Wikipedia List Pages

10/22/13 Heiko Paulheim, Simone Paolo Ponzetto 14

Extending DBpedia with Wikipedia List Pages

Heiko Paulheim, Christian Bizer