F rom U nstructured I nformation t o L inked D ata

52
FROM UNSTRUCTURED INFORMATION TO LINKED DATA Axel Ngonga Head of SIMBA@AKSW University of Leipzig IASLOD, August 15/16 th 2012

description

F rom U nstructured I nformation t o L inked D ata. Axel Ngonga Head of SIMBA@AKSW University of Leipzig IASLOD, August 15/16 th 2012. Motivation. Motivation. Where does the LOD Cloud come from ? Structured data Triplify , D2R Semi- structured data DBpedia Unstructured data - PowerPoint PPT Presentation

Transcript of F rom U nstructured I nformation t o L inked D ata

Page 1: F rom U nstructured  I nformation t o L inked  D ata

FROM UNSTRUCTURED INFORMATION TO LINKED DATA

Axel NgongaHead of SIMBA@AKSWUniversity of LeipzigIASLOD, August 15/16th 2012

Page 2: F rom U nstructured  I nformation t o L inked  D ata

Motivation

Page 3: F rom U nstructured  I nformation t o L inked  D ata

Motivation• Where does the LOD Cloud come from?

• Structured data• Triplify, D2R

• Semi-structured data• DBpedia

• Unstructured data• ???

• Unstructured data make up 80% of the Web• How do we extract Linked Data from unstructured data

sources?

Page 4: F rom U nstructured  I nformation t o L inked  D ata

Overview1. Problem Definition2. Named Entity Recognition

• Algorithms• Ensemble Learning

3. Relation Extraction• General approaches• OpenIE approaches

4. Entity Disambiguation• URI Lookup• Disambiguation

5. Conclusion

NB: Will be mainly concerned with the newest developments.

Page 5: F rom U nstructured  I nformation t o L inked  D ata

Overview1. Problem Definition2. Named Entity Recognition

• Algorithms• Ensemble Learning

3. Relation Extraction• General approaches• OpenIE approaches

4. Entity Disambiguation• URI Lookup• Disambiguation

5. Conclusion

Page 6: F rom U nstructured  I nformation t o L inked  D ata

Problem Definition• Simple(?) problem: given a text fragment, retrieve

• All entities and• relations between these entities automatically plus• „ground them“ in an ontology

• Also coined Knowledge Extraction

John Petrucci was born in New York.

:John_Petrucci:New_York

dbo:birthPlace

:John_Petrucci dbo:birthPlace :New_York .

Page 7: F rom U nstructured  I nformation t o L inked  D ata

Problems1. Finding entities

Named Entity Recognition2. Finding relation instances

Relation Extraction3. Finding URIs

URI Disambiguation

Page 8: F rom U nstructured  I nformation t o L inked  D ata

Overview1. Problem Definition2. Named Entity Recognition

• Algorithms• Ensemble Learning

3. Relation Extraction• General approaches• OpenIE approaches

4. Entity Disambiguation• URI Lookup• Disambiguation

5. Conclusion

Page 9: F rom U nstructured  I nformation t o L inked  D ata

Named Entity Recognition• Problem definition: Given a set of classes, find all

strings that are labels of instances of these classes within a text fragment

John Petrucci was born in New York.

[John Petrucci, PER] was born in [New York, LOC].

Page 10: F rom U nstructured  I nformation t o L inked  D ata

Named Entity Recognition• Problem definition: Given a set of classes, find all

strings that are labels of instances of these classes within a text fragment

• Common sets of classes• CoNLL03: Person, Location, Organization, Miscelleaneous• ACE05: Facility, Geo-Political Entity, Location, Organisation,

Person, Vehicle, Weapon• BioNLP2004: Protein, DNA, RNA, cell line, cell type

• Several approaches• Direct solutions (single algorithms)• Ensemble Learning

Page 11: F rom U nstructured  I nformation t o L inked  D ata

NER: Overview of approaches• Dictionary-based• Hand-crafted Rules• Machine Learning

• Hidden Markov Model (HMMs)• Conditional Random Fields (CRFs)• Neural Networks• k Nearest Neighbors (kNN)• Graph Clustering

• Ensemble Learning• Veto-Based (Bagging, Boosting)• Neural Networks

Page 12: F rom U nstructured  I nformation t o L inked  D ata

NER: Dictionary-based• Simple Idea1. Define mappings between words and classes, e. g.,

Paris Location2. Try to match each token from each sentence3. Return the mapping entities

Time-Efficient at runtime× Manuel creation of gazeteers× Low Precision (Paris = Person, Location)× Low Recall (esp. on Persons and Organizations as the

number of instances grows)

Page 13: F rom U nstructured  I nformation t o L inked  D ata

NER: Rule-based• Simple Idea1. Define a set of rule to find entities, e.g.,

[PERSON] was born in [LOCATION].2. Try to match each sentence to one or several rules3. Return the mapping entities

High precision × Manuel creation of rules is very tedious × Low recall (finite number of patterns)

Page 14: F rom U nstructured  I nformation t o L inked  D ata

NER: Markov Models• Stochastic process such that (Markov Property)

) = )

• Equivalent to finite-state machine• Formally consists of

• Set S of states S1, … , Sn

• Matrix M such that mij = P(Xt+1=Sj|Xt=Si)

Page 15: F rom U nstructured  I nformation t o L inked  D ata

NER: Hidden Markov Models• Extension of Markov Models

• States are hidden and assigned an output function• Only output is seen• Transitions are learned from training data

• How do they work?• Input: Discrete sequence of features

(e.g., POS Tags, word stems, etc.)• Goal: Find the best sequence of states

that represent the input• Output: hopefully right classification

of each token

S0

S1

Sn

PER

_

LOC

Page 16: F rom U nstructured  I nformation t o L inked  D ata

NER: k Nearest Neighbors• Idea

• Describe each token q from a labelled training data set with a set of features (e.g., left and right neigbors)

• Each new token t is described with the same features• Assign t the class of its k nearest neighbors

Page 17: F rom U nstructured  I nformation t o L inked  D ata

NER: So far …• „Simple approaches“

• Apply one algorithm to the NER problem• Bound to be limited by assumptions of model

• Implemented by a large number of tools• Alchemy• Stanford NER• Illinois Tagger• Ontos NER Tagger• LingPipe• …

Page 18: F rom U nstructured  I nformation t o L inked  D ata

NER: Ensemble Learning• Intuition: Each algorithm has its strengths and

weaknesses• Idea: Use ensemble learning to merge results of different

algorithms so as to create a meta-classifier of higher accuracy

Dictionary-based

approaches Pattern-based

approaches

Condition Random FieldsSupport Vector

Machines

Page 19: F rom U nstructured  I nformation t o L inked  D ata

NER: Ensemble Learning• Idea: Merge the results of several approaches for

improving results• Simplest approaches:

• Voting• Weighted voting

Input

System 1 System 2 System n

Merger

Output

Page 20: F rom U nstructured  I nformation t o L inked  D ata

NER: Ensemble Learning• When does it work?• Accuracy

• Need for exisiting solutions to be „good“• Merging random results lead to random results• Given, current approaches reach 80% F-Score

• Diversity• Need for smallest possible amount of correlation

between approaches• E.g., merging two HMM-based taggers won‘t help• Given, large number of approaches for NER

Page 21: F rom U nstructured  I nformation t o L inked  D ata

NER:FOX• Federated Knowledge Extraction Framework• Idea: Apply ensemble learning to NER• Classical approach: Voting

• Does not make use of systematic error• Partly difficult to train

• Use neural networks instead• Can make use of systematic

errory• Easy to train• Converge fast• http://fox.aksw.org

Page 22: F rom U nstructured  I nformation t o L inked  D ata

NER: FOX

Page 23: F rom U nstructured  I nformation t o L inked  D ata

NER: FOX on MUC7

Page 24: F rom U nstructured  I nformation t o L inked  D ata

NER: FOX on MUC7

Page 25: F rom U nstructured  I nformation t o L inked  D ata

NER: FOX on Website Data

Page 26: F rom U nstructured  I nformation t o L inked  D ata

NER: FOX on Website Data

Page 27: F rom U nstructured  I nformation t o L inked  D ata

NER: FOX on Companies and Countries

No runtime issues (parallel implementation) NN overhead is small× Overfitting

Page 28: F rom U nstructured  I nformation t o L inked  D ata

NER: Summary• Large number of approaches

• Dictionaries• Hand-Crafted rules• Machine Learning• Hybrid• …

Combining approaches leads to better results than single algorithms

Page 29: F rom U nstructured  I nformation t o L inked  D ata

Overview1. Problem Definition2. Named Entity Recognition

• Algorithms• Ensemble Learning

3. Relation Extraction• General approaches• OpenIE approaches

4. Entity Disambiguation• URI Lookup• Disambiguation

5. Conclusion

Page 30: F rom U nstructured  I nformation t o L inked  D ata

RE: Problem Definition• Find the relations between NEs if such relations exist.• NEs not always given a-priori (open vs. closed RE)

bornIn ([John Petrucci, PER], [New York, LOC]).

John Petrucci was born in New York.

[John Petrucci, PER] was born in [New York, LOC].

Page 31: F rom U nstructured  I nformation t o L inked  D ata

RE: Approaches• Hand-crafted rules• Pattern Learning• Coupled Learning

Page 32: F rom U nstructured  I nformation t o L inked  D ata

RE: Pattern-based• Hearst patterns [Hearst: COLING‘92]

• POS-enhanced regular expression matching in natural-language text

NP0 {,} such as {NP1, NP2, … (and|or) }{,} NPn

NP0 {,}{NP1, NP2, … NPn-1}{,} or other NPn

“The bow lute, such as the Bambara ndang, is plucked and has an individual curved neck for each string.” isA(“Bambara ndang”, “bow lute”)

Time-Efficient at runtime× Very low recall× Not adaptable to other relations

Page 33: F rom U nstructured  I nformation t o L inked  D ata

RE: DIPRE• DIPRE = Dual Iterative Pattern Relation Extraction• Semi-supervised, iterative gathering of facts and patterns• Positive & negative examples as seeds for a given target

relation• e.g. +(Hillary, Bill) ; +(Carla, Nicolas); –(Larry, Google)

• Various tuning parameters for pruning low-confidence patterns and facts

• Extended to SnowBall / QXtract

(Hillary, Bill)

(Carla, Nicolas)X and her husband Y

X and Y on their honeymoon

X and Y and their childrenX has been dating with YX loves Y

(Angelina, Brad)

(Hillary, Bill)(Victoria, David)

(Carla, Nicolas)

(Larry, Google)…

Page 34: F rom U nstructured  I nformation t o L inked  D ata

RE: NELL• Never-Ending Language Learner (http://rtw.ml.cmu.edu/)• Open IE with ontological backbone

• Closed set of categories & typed relations

• Seeds/counter seeds (5-10)• Open set of predicate arguments

(instances)• Coupled iterative learners • Constantly running over a large Web corpus

since January 2010 (200 Mio pages)• Periodic human supervision

athletePlaysForTeam(Athlete, SportsTeam)

athletePlaysForTeam(Alex Rodriguez, Yankees)

athletePlaysForTeam(Alexander_Ovechkin, Penguins)

Page 35: F rom U nstructured  I nformation t o L inked  D ata

RE: NELL

Conservative strategy Avoid Semantic Drift

Page 36: F rom U nstructured  I nformation t o L inked  D ata

RE: BOA• Bootstrapping Linked Data (http://boa.aksw.org)• Core idea: Use instance data in Data Web to discover NL

patterns and new instances

Page 37: F rom U nstructured  I nformation t o L inked  D ata

RE: BOA• Follows conservative strategy

• Only top pattern• Frequency threshold• Score Threshold

• Evaluation results

Page 38: F rom U nstructured  I nformation t o L inked  D ata

RE: Summary• Several approaches

• Hand-crafted rules• Machine Learning• Hybrid

Large number of instances available for many relations Runtime problem Parallel implementations Many new facts can be found× Semantic Drift× Long tail× Entity Disambiguation

Page 39: F rom U nstructured  I nformation t o L inked  D ata

Overview1. Problem Definition2. Named Entity Recognition

• Algorithms• Ensemble Learning

3. Relation Extraction• General approaches• OpenIE approaches

4. Entity Disambiguation• URI Lookup• Disambiguation

5. Conclusion

Page 40: F rom U nstructured  I nformation t o L inked  D ata

ED: Problem Definition• Given (a) refence knowledge base(s), a text fragment, a

list of NEs (incl. position), and a list a relations, find URIs for each of the NEs and relations

• Very difficult problem• Ambiguity, e.g., Paris = Paris Hilton? Paris (France)?• Difficult even for humans, e.g.,• Paris‘ mayor died yesterday

• Several solutions• Indexing• Surface Form• Graph-based

Page 41: F rom U nstructured  I nformation t o L inked  D ata

ED: Problem Definition

bornIn ([John Petrucci, PER], [New York, LOC]).

John Petrucci was born in New York.

[John Petrucci, PER] was born in [New York, LOC].:John_Petrucci dbo:birthPlace :New_York .

Page 42: F rom U nstructured  I nformation t o L inked  D ata

ED: Indexing• More retrieval than disambiguation• Similar to dictionary-based approaches• Idea

• Index all labels in reference knowledge base• Given an input label, retrieve all entities with a similar

label× Poor recall (unknown surface form, e.g., „Mme Curie“ für

„Marie Curie“)× Low precision (Paris = Paris Hilton, Paris (France), …)

Page 43: F rom U nstructured  I nformation t o L inked  D ata

ED: Type Disambiguation• Extension of indexing

• Index all labels• Infer type information• Retrieve labels from entities of the given type

• Same recall as previous approach• Higher precision

• Paris[LOC] != Paris[PER]• Still, Paris (France) vs. Paris (Ontario)

• Need for context

Page 44: F rom U nstructured  I nformation t o L inked  D ata

ED: Spotlight• Known surface forms (http://dbpedia.org/spotlight)

• Based on DBpedia + Wikipedia• Uses supplementary knowledge including disambiguation

pages, redirects, wikilinks• Three main steps

• Spotting: Finding possible mentions of DBpedia resources, e.g.,

John Petrucci was born in New York.• Candidate Selection: Find possible URIs, e.g.,

John Petrucci :JohnPetrucciNew York :New_York, :New_York_County, …

• Disambiguation: Map context to vector for each resource New York :New_York

Page 45: F rom U nstructured  I nformation t o L inked  D ata

ED: YAGO2• Joint Disambiguation

Mississippi, one of Bob’s later songs, was first recorded by Sheryl on her album.

Page 46: F rom U nstructured  I nformation t o L inked  D ata

ED: YAGO2

Mississippi (State)

Bob Dylan Songs

Sheryl Cruz

Sheryl Lee

Mississippi (Song)

Sheryl Crow

Objective: Maximize objective function (e.g., total weight)

Constraint: Keep at least one entity per mention

Mentions of Entities Entity Candidatessim(cxt(ml ),cxt(ei ))

prior(ml ,ei )

coh(ei ,ej )

Page 47: F rom U nstructured  I nformation t o L inked  D ata

ED: FOX• Generic Approach

• A-priori score (a): Popularity of URIs• Similarity score (s): Similarity of resource labels and text• Coherence score (z): Correlation between URIs

49

a|s

a|sz

Page 48: F rom U nstructured  I nformation t o L inked  D ata

ED:FOX• Allows the use of several algorithms

• HITS• Pagerank• Apriori• Propagation Algorithms• …

50

Page 49: F rom U nstructured  I nformation t o L inked  D ata

ED: Summary• Difficult problem even for humans• Several approaches

• Simple search• Search with restrictions• Known surface forms• Graph-based

Improved F-Score for DBpedia (70-80%)× Low F-Score for generic knowledge bases× Intrinsically difficult× Still a lot to do

Page 50: F rom U nstructured  I nformation t o L inked  D ata

Overview1. Problem Definition2. Named Entity Recognition

• Algorithms• Ensemble Learning

3. Relation Extraction• General approaches• OpenIE approaches

4. Entity Disambiguation• URI Lookup• Disambiguation

5. Conclusion

Page 51: F rom U nstructured  I nformation t o L inked  D ata

Conclusion• Discussed basics of …

• Knowledge Extraction problem• Named Entity Recognition• Relation Extraction• Entity Disambiguation

• Still a lot of research necessary• Ensemble and active Learning• Entity Disambiguation• Question Answering …

Page 52: F rom U nstructured  I nformation t o L inked  D ata

Thank You!Questions?