DBpedia Spotlight at I-SEMANTICS 2011

42
DBpedia Spotlight Shedding Light on the Web of Documents Pablo N. Mendes, Max Jakob, Andrés Garcia-Silva, Christian Bizer [email protected] I-SEMANTICS, Graz, Austria September 9th 2011 1

description

DBpedia Spotlight: a configurable annotation tool to support a variety of use cases. Given input text in English, we extract DBpedia Resources and generate annotations according to user-provided configuration parameters. These parameters can include score thresholds, entity types, and even arbitrary "type" definitions through SPARQL queries. This is the presentation at the best paper award session at I-SEMANTICS 2011.

Transcript of DBpedia Spotlight at I-SEMANTICS 2011

Page 1: DBpedia Spotlight at I-SEMANTICS 2011

1

DBpedia SpotlightShedding Light on the Web of Documents

Pablo N. Mendes, Max Jakob, Andrés Garcia-Silva, Christian [email protected]

I-SEMANTICS, Graz, AustriaSeptember 9th 2011

Page 2: DBpedia Spotlight at I-SEMANTICS 2011

Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents 2

Web Based Systems GroupFREIE UNIVERSITÄT BERLIN

Agenda

• What is text annotation?• What can you build with it?• Why is it difficult?• How did we approach the challenge?• How well did it work?• What are the next steps?

Page 3: DBpedia Spotlight at I-SEMANTICS 2011

3

WHAT IS IT?

Page 4: DBpedia Spotlight at I-SEMANTICS 2011

Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents 4

Web Based Systems GroupFREIE UNIVERSITÄT BERLIN

Text Annotation

• From:

• To:

(…) Upon their return, Lennon and McCartney went to New York to announce the formation of Apple Corps.

(…) Upon their return, Lennon and McCartney went to New York to announce the formation of Apple Corps.

http://dbpedia.org/resource/Apple_Corpshttp://dbpedia.org/resource/New_York_City

Page 5: DBpedia Spotlight at I-SEMANTICS 2011

Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents 5

Web Based Systems GroupFREIE UNIVERSITÄT BERLIN

Challenge: Term Ambiguity

• ...this apple on the palm of my hand...• ...Apple tried to acquire Palm Inc....• ...eating an apple sitted by a palm tree...

• What do “apple” and “palm” mean in each case?

• Our objective is to recognize entities and disambiguate their meaning, generating DBpedia annotation in text.

Page 6: DBpedia Spotlight at I-SEMANTICS 2011

Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents 6

Web Based Systems GroupFREIE UNIVERSITÄT BERLIN

What can you do with annotations?

• Links to complementary information– “More about this”

• Faceted browsing of blog posts– Show only posts with topics related to Sports

• Rich snippets on Google– Search engines start to display info from annotations

• More expressive filtering of information streams– Twarql (entry at I-SEMANTICS 2010 Challenge)

Page 7: DBpedia Spotlight at I-SEMANTICS 2011

Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents 7

Web Based Systems GroupFREIE UNIVERSITÄT BERLIN

Rich Snippets

• Search Engines already benefit from some kinds of annotations

http://www.google.com/webmasters/tools/richsnippets

Page 8: DBpedia Spotlight at I-SEMANTICS 2011

Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents 8

Web Based Systems GroupFREIE UNIVERSITÄT BERLIN

Twarql Example Use Case

• What competitors of my product are being mentioned with my product on Twitter?

SELECT ? competitorWHERE { dbpedia:IPad skos:subject ?category . ?competitor skos:subject ?category . ?tweet moat:taggedWith ?competitor .

} ?tweet moat:taggedWith dbpedia:Ipad .

- comparative opinion!

Page 9: DBpedia Spotlight at I-SEMANTICS 2011

Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents 9

Web Based Systems GroupFREIE UNIVERSITÄT BERLIN

?category

Twarql Example Use Case (2)

?competitor ?category

?tweet

moat:taggedWith

skos:subjectskos:subject

skos:subject

Background Knowledge (e.g. DBpedia)@anonymizedLorem ipsum bla bla this is an example tweet

Incoming microposts…

dbpedia:IPad

Competition is modeled as two products in the same category in DBpedia

Page 10: DBpedia Spotlight at I-SEMANTICS 2011

Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents 10

Web Based Systems GroupFREIE UNIVERSITÄT BERLIN

?category

Twarql Example Use Case (3)

?competitor ?category

?tweet

moat:taggedWith

skos:subjectskos:subject

category:Wi-Fi

category:Touchscreen

skos:subject

Background Knowledge (e.g. DBpedia)@anonymizedLorem ipsum bla bla this is an example tweet

Incoming microposts…

dbpedia:IPad

Background knowledge is dynamically “brought into” microposts.

Page 11: DBpedia Spotlight at I-SEMANTICS 2011

Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents 11

Web Based Systems GroupFREIE UNIVERSITÄT BERLIN

?category

Twarql Example Use Case (4)

?competitor ?category

?tweet

moat:taggedWith

skos:subjectskos:subject

category:Wi-Fi

category:Touchscreen

skos:subject

Background Knowledge (e.g. DBpedia)

@anonymizedLorem ipsum bla bla this is an example tweet

dbpedia:IPad

Trigger action if micropost matches constraints.

Page 12: DBpedia Spotlight at I-SEMANTICS 2011

Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents 12

Web Based Systems GroupFREIE UNIVERSITÄT BERLIN

DBpedia Spotlight

• DBpedia is a collection of entity descriptions extracted from Wikipedia & shared as linked data

• DBpedia Spotlight uses data from DBpedia and text from associated Wikipedia pages

• Learns how to recognize that a DBpedia resource was mentioned

• Given plain text as input, generates annotated text

Page 13: DBpedia Spotlight at I-SEMANTICS 2011

13

WHY IS IT DIFFICULT?

Page 14: DBpedia Spotlight at I-SEMANTICS 2011

Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents 14

Web Based Systems GroupFREIE UNIVERSITÄT BERLIN

Dataset overview

• Volume of Wikipedia– 56,9 GB in raw text data

• Occurrences of Ambiguous Terms in Wikipedia: 58.8%

• Sparsity: less data for some DBpedia resources

Page 15: DBpedia Spotlight at I-SEMANTICS 2011

Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents 15

Web Based Systems GroupFREIE UNIVERSITÄT BERLIN

Histogram: URI occurrences

log(n(uri))))

Many “rare” URIs, (few links on Wikipedia)

Few “popular” URIs(lots of links on Wikipedia)

Most of previous work deals with these entities:People, Organization, Location

Page 16: DBpedia Spotlight at I-SEMANTICS 2011

Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents 17

Web Based Systems GroupFREIE UNIVERSITÄT BERLIN

Ambiguity

URI # of S.F. Example Surface FormsCase_citation 1220not yet reported; AC 429; AC 847; …United_States 283United Stated; U.S. national; American independence; …

Roman_numerals 212Roman numeral system; MDCXLII; …

Gramophone_record 2037 inch single; 45 rpm record; …Indigenous_peoples_of_the_Americas 192Native American People; Indigenous cultures; Music_recording_sales_certification 190certified gold; silver status; PlatinuM record; Billboard_Hot_100 190the Billboard Hot 100 chart; #1; U.S.;

World_War_II 18339-45 war; War years; 1939-45 war; war-torn;

Operation_Barbarossa 174invasion of Soviet Russia; German attack on the Soviet Union in June 1941; United_States_presidential_election%2C_2008 173Election Day; US election; Atomic_bombings_of_Hiroshima_and_Nagasaki 172bombs that were dropped; dropping; Atomic Bomb;

What are the most ambiguous surface forms?

Page 17: DBpedia Spotlight at I-SEMANTICS 2011

Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents 18

Web Based Systems GroupFREIE UNIVERSITÄT BERLIN

Name Variation

Surface Form Number of URIs Example URIs2008 1199 Boxing_at_the_2008_Summer_Olympics; Hockey_at_the_2008_Summer_Olympics; ...2007 1197 2007_Cleveland_Indians_season; 2007_NBA_All-Star_Game; ...2006 11512004 10332009 10022005 9962003 872 Windows_2003; 2002 848John 847199920012000same name 719...President 560 Franklin_Delano_Roosevelt...

What are the URIs with many surface forms?

Page 18: DBpedia Spotlight at I-SEMANTICS 2011

19

HOW DID WE APPROACH THE CHALLENGE?

Page 19: DBpedia Spotlight at I-SEMANTICS 2011

Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents 20

Web Based Systems GroupFREIE UNIVERSITÄT BERLIN

A 4-stage approach

• Spotting

• Candidate Mapping

• Disambiguation

• Linking

Page 20: DBpedia Spotlight at I-SEMANTICS 2011

Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents 21

Web Based Systems GroupFREIE UNIVERSITÄT BERLIN

Stage 1: Spotting

• Find substrings that seem worthy of annotation

• Naïve implementation (impractical)– all n-grams of length (1,|text|)

(…) Upon their return, Lennon and McCartney went to New York to announce the formation of Apple Corps.

Output:“Lennon”, “McCartney”, “New York”, “Apple Corps”

Input:

Page 21: DBpedia Spotlight at I-SEMANTICS 2011

Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents 22

Web Based Systems GroupFREIE UNIVERSITÄT BERLIN

Spotting in DBpedia Spotlight

• Detect that the label (surface form) of a DBpedia Resource was mentioned– Lexicalized, Aho-Corasick algorithm (LingPipe)– Name variations from redirects, disambiguation pages, anchor texts

• Advantages: – Simple implementation, well studied problem,– Produces a reduced set of spots, – Relies on user provided terms.

• Drawback: – high memory requirements (~7G)

Page 22: DBpedia Spotlight at I-SEMANTICS 2011

Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents 23

Web Based Systems GroupFREIE UNIVERSITÄT BERLIN

Stage 2: Candidate Mapping

• What are the possible senses of a given surface form (the candidate DBpedia resources)?

Input:“Lennon”, “McCartney”, “New York”, “Apple Corps”

Output:“Lennon”: { Lennon_(album), Lennon,_Michigan, … }“McCartney”: { McCartney(surname), Paul_McCartney, … }“New York”: { New_York_State, New_York_City, … }“Apple Corps”: { Apple_Corps }

Page 23: DBpedia Spotlight at I-SEMANTICS 2011

Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents 24

Web Based Systems GroupFREIE UNIVERSITÄT BERLIN

Candidate Mapping in DBpedia Spotlight

• Sources of mappings between surface forms and DBpedia Resources– Page titles offer “chosen names” for resources– Redirects offer alternative spellings, aliases, etc.– Disambiguation Pages: link a common term to

many resources

Page 24: DBpedia Spotlight at I-SEMANTICS 2011

Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents 27

Web Based Systems GroupFREIE UNIVERSITÄT BERLIN

Stage 3: Disambiguation

• Select the correct candidate DBpedia Resource for a given surface form.

• Decision is made based on the context(1) the surface form was mentioned

con·text (kntkst)n.1. the parts of a discourse that surround a word or passage and can throw light on its meaning2. The circumstances in which an event occurs; a setting. http://mw1.merriam-webster.com/dictionary/context

Page 25: DBpedia Spotlight at I-SEMANTICS 2011

Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents 28

Web Based Systems GroupFREIE UNIVERSITÄT BERLIN

Learning the Context for a resource

• Collect context for DBpedia Resources from Wikipedia

• Types of context– Wikipedia Pages – Definitions from disambiguation pages– Paragraphs that link to resources

(…) Upon their return, Lennon and McCartney went to New York to announce the formation of Apple Corps.

Page 26: DBpedia Spotlight at I-SEMANTICS 2011

Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents 29

Web Based Systems GroupFREIE UNIVERSITÄT BERLIN

Disambiguation in DBpedia Spotlight

• Model DBpedia Resources as vectors of terms found in Wikipedia text

• Define functions for term scoring and vector similarity (e.g. frequency and cosine)

• Rank candidate resource vectors based on their similarity with vector of input text

• Choose highest ranking candidate

Lennon = {Beatles,McCartney,rock,guitar,...}

Lennon = {tf(Beatles)=320,tf(McCartney)=100,...}Cos(Input,Lennon) = 0.12

Page 27: DBpedia Spotlight at I-SEMANTICS 2011

Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents 30

Web Based Systems GroupFREIE UNIVERSITÄT BERLIN

Scoring Strategies

• TF*IDF (Term Freq. * Inverse Doc. Freq.)– TF: insight into the relevance of the term in the

context of a DBpedia Resource– IDF: insight into the rarity of the term. Co-

occurrence of rare terms is more informative• ICF: Inverse Candidate Frequency– IDF is the “rarity” in the entire Wikipedia– ICF is the rarity of a word with relation to the

possible senses only

Page 28: DBpedia Spotlight at I-SEMANTICS 2011

Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents 32

Web Based Systems GroupFREIE UNIVERSITÄT BERLIN

Linking (Configuration)

• Decide which spots to annotate with links to the disambiguated resources

• Different use cases have different needs– Only annotate prominent resources?– Only if you’re sure disambiguation is correct?– Only people?– Only things related to Berlin?

Page 29: DBpedia Spotlight at I-SEMANTICS 2011

Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents 33

Web Based Systems GroupFREIE UNIVERSITÄT BERLIN

Linking in DBpedia Spotlight

• Can be configured based on:– Thresholds• Confidence• Prominence (support)

– Whitelist or Blacklist of types• Hide all people, Show only organizations

– Complex definition of a “type” through a SPARQL query.

Page 30: DBpedia Spotlight at I-SEMANTICS 2011

34

HOW WELL DID IT WORK?

Page 31: DBpedia Spotlight at I-SEMANTICS 2011

Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents 35

Web Based Systems GroupFREIE UNIVERSITÄT BERLIN

Evaluation: Disambiguation

• Used held out (unseen) Wikipedia occurrences as test data

• Evaluates accuracy of disambiguation stage• Baselines– Random: performs well with low ambiguity– Default Sense: only prominence, without context– Default Similarity (TF*IDF) : Lucene

implementation

Page 32: DBpedia Spotlight at I-SEMANTICS 2011

Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents 36

Web Based Systems GroupFREIE UNIVERSITÄT BERLIN

Disambiguation Evaluation Results

Page 33: DBpedia Spotlight at I-SEMANTICS 2011

Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents 37

Web Based Systems GroupFREIE UNIVERSITÄT BERLIN

Evaluation: Annotation

• News text, different topics• Hand-annotated examples by 4 annotators• Gold standard from agreement• Evaluates precision and recall of annotations.

Page 34: DBpedia Spotlight at I-SEMANTICS 2011

Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents 38

Web Based Systems GroupFREIE UNIVERSITÄT BERLIN

Annotation Evaluation Results (2)

Page 35: DBpedia Spotlight at I-SEMANTICS 2011

Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents 39

Web Based Systems GroupFREIE UNIVERSITÄT BERLIN

Annotation Evaluation Results

Page 36: DBpedia Spotlight at I-SEMANTICS 2011

Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents 40

Web Based Systems GroupFREIE UNIVERSITÄT BERLIN

Conclusions

• DBpedia Spotlight: a configurable annotation tool to support a variety of use cases

• Very simple methods work surprisingly well for disambiguation

• More work is needed to alleviate sparsity• Most challenging step is linking• More evaluation on larger annotation datasets

is needed

Page 37: DBpedia Spotlight at I-SEMANTICS 2011

41

WHAT ARE THE NEXT STEPS?

Page 38: DBpedia Spotlight at I-SEMANTICS 2011

Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents 42

Web Based Systems GroupFREIE UNIVERSITÄT BERLIN

A preview of next release

• CORS-enabled + jQuery client– One line to annotate any web page:

• A new demo interface: based on the plugin• Types: DBpedia 3.7, Freebase, Schema.org• New configuration parameters– E.g. perform smarter spotting

• Easier install: maven2, jar, debian package

$(“div”).annotate()

Page 39: DBpedia Spotlight at I-SEMANTICS 2011

Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents 43

Web Based Systems GroupFREIE UNIVERSITÄT BERLIN

Preview:

Temporarily available for I-SEMANTICS 2011

http://spotlight.dbpedia.org/dev/demo

Page 40: DBpedia Spotlight at I-SEMANTICS 2011

Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents 44

Web Based Systems GroupFREIE UNIVERSITÄT BERLIN

Future work

• Internationalization (German, Spanish,...)• More sophisticated spotting• New disambiguation strategies– Global disambiguation: one disambiguation

decision helps the other decisions• Sparsity problems: try smoothing,

dimensionality reduction, etc.• Store user feedback, learn from mistakes

Page 41: DBpedia Spotlight at I-SEMANTICS 2011

Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents 45

Web Based Systems GroupFREIE UNIVERSITÄT BERLIN

We are open

• Tell us about your use cases• Hack something with us– Drupal/Wordpress Plugin– Semantic Media Wiki integration

• Are you a good engineer?– Help us make it faster, smaller!

• Are you a good researcher?– Let’s collaborate on your/our ideas.

Licensed as Apache v2.0(Business friendly)

Page 42: DBpedia Spotlight at I-SEMANTICS 2011

Mendes, Jakob, Garcia-Silva, Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents 46

Web Based Systems GroupFREIE UNIVERSITÄT BERLIN

Thank you!

• On Twitter: @pablomendes• E-mail: [email protected]• Web: http://pablomendes.com

• Special thanks to Jo Daiber (working with us for the next release)

• Partially funded by LOD2.eu and Neofonie Gmbh

http://spotlight.dbpedia.org