F ROM U NSTRUCTURED I NFORMATION T O L INKED D ATA Axel Ngonga Head of SIMBA@AKSW University of...

Post on 12-Jan-2016

215 views 0 download

Tags:

Transcript of F ROM U NSTRUCTURED I NFORMATION T O L INKED D ATA Axel Ngonga Head of SIMBA@AKSW University of...

FROM UNSTRUCTURED INFORMATION TO LINKED DATA

Axel Ngonga

Head of SIMBA@AKSW

University of Leipzig

IASLOD, August 15/16th 2012

Motivation

Motivation• Where does the LOD Cloud come from?

• Structured data• Triplify, D2R

• Semi-structured data• DBpedia

• Unstructured data• ???

• Unstructured data make up 80% of the Web• How do we extract Linked Data from unstructured data

sources?

Overview

1. Problem Definition

2. Named Entity Recognition• Algorithms• Ensemble Learning

3. Relation Extraction• General approaches• OpenIE approaches

4. Entity Disambiguation• URI Lookup• Disambiguation

5. Conclusion

NB: Will be mainly concerned with the newest developments.

Overview

1. Problem Definition

2. Named Entity Recognition• Algorithms• Ensemble Learning

3. Relation Extraction• General approaches• OpenIE approaches

4. Entity Disambiguation• URI Lookup• Disambiguation

5. Conclusion

Problem Definition• Simple(?) problem: given a text fragment, retrieve

• All entities and• relations between these entities automatically plus• „ground them“ in an ontology

• Also coined Knowledge Extraction

John Petrucci was born in New York.

:John_Petrucci:New_York

dbo:birthPlace

:John_Petrucci dbo:birthPlace :New_York .

Problems

1. Finding entities

Named Entity Recognition

2. Finding relation instances

Relation Extraction

3. Finding URIs

URI Disambiguation

Overview

1. Problem Definition

2. Named Entity Recognition• Algorithms• Ensemble Learning

3. Relation Extraction• General approaches• OpenIE approaches

4. Entity Disambiguation• URI Lookup• Disambiguation

5. Conclusion

Named Entity Recognition• Problem definition: Given a set of classes, find all

strings that are labels of instances of these classes within a text fragment

John Petrucci was born in New York.

[John Petrucci, PER] was born in [New York, LOC].

Named Entity Recognition• Problem definition: Given a set of classes, find all

strings that are labels of instances of these classes within a text fragment

• Common sets of classes• CoNLL03: Person, Location, Organization, Miscelleaneous• ACE05: Facility, Geo-Political Entity, Location, Organisation,

Person, Vehicle, Weapon• BioNLP2004: Protein, DNA, RNA, cell line, cell type

• Several approaches• Direct solutions (single algorithms)• Ensemble Learning

NER: Overview of approaches• Dictionary-based• Hand-crafted Rules• Machine Learning

• Hidden Markov Model (HMMs)• Conditional Random Fields (CRFs)• Neural Networks• k Nearest Neighbors (kNN)• Graph Clustering

• Ensemble Learning• Veto-Based (Bagging, Boosting)• Neural Networks

NER: Dictionary-based• Simple Idea

1. Define mappings between words and classes, e. g., Paris Location

2. Try to match each token from each sentence

3. Return the mapping entities

Time-Efficient at runtime× Manuel creation of gazeteers× Low Precision (Paris = Person, Location)× Low Recall (esp. on Persons and Organizations as the

number of instances grows)

NER: Rule-based• Simple Idea

1. Define a set of rule to find entities, e.g., [PERSON] was born in [LOCATION].

2. Try to match each sentence to one or several rules

3. Return the mapping entities

High precision × Manuel creation of rules is very tedious × Low recall (finite number of patterns)

NER: Markov Models• Stochastic process such that (Markov Property)

) = )

• Equivalent to finite-state machine• Formally consists of

• Set S of states S1, … , Sn

• Matrix M such that mij = P(Xt+1=Sj|Xt=Si)

NER: Hidden Markov Models• Extension of Markov Models

• States are hidden and assigned an output function• Only output is seen• Transitions are learned from training data

• How do they work?• Input: Discrete sequence of features

(e.g., POS Tags, word stems, etc.)• Goal: Find the best sequence of states

that represent the input• Output: hopefully right classification

of each token

S0

S1

Sn

PER

_

LOC

NER: k Nearest Neighbors• Idea

• Describe each token q from a labelled training data set with a set of features (e.g., left and right neigbors)

• Each new token t is described with the same features

• Assign t the class of its k nearest neighbors

NER: So far …• „Simple approaches“

• Apply one algorithm to the NER problem• Bound to be limited by assumptions of model

• Implemented by a large number of tools• Alchemy• Stanford NER• Illinois Tagger• Ontos NER Tagger• LingPipe• …

NER: Ensemble Learning• Intuition: Each algorithm has its strengths and

weaknesses• Idea: Use ensemble learning to merge results of different

algorithms so as to create a meta-classifier of higher accuracy

Dictionary-based

approaches Pattern-based

approaches

Condition Random FieldsSupport Vector

Machines

NER: Ensemble Learning• Idea: Merge the results of several approaches for

improving results• Simplest approaches:

• Voting• Weighted voting

Input

System 1 System 2 System n

Merger

Output

NER: Ensemble Learning• When does it work?• Accuracy

• Need for exisiting solutions to be „good“• Merging random results lead to random results• Given, current approaches reach 80% F-Score

• Diversity• Need for smallest possible amount of correlation

between approaches• E.g., merging two HMM-based taggers won‘t help• Given, large number of approaches for NER

NER:FOX• Federated Knowledge Extraction Framework• Idea: Apply ensemble learning to NER• Classical approach: Voting

• Does not make use of systematic error• Partly difficult to train

• Use neural networks instead• Can make use of systematic

errory• Easy to train• Converge fast• http://fox.aksw.org

NER: FOX

NER: FOX on MUC7

NER: FOX on MUC7

NER: FOX on Website Data

NER: FOX on Website Data

NER: FOX on Companies and Countries

No runtime issues (parallel implementation) NN overhead is small× Overfitting

NER: Summary• Large number of approaches

• Dictionaries• Hand-Crafted rules• Machine Learning• Hybrid• …

Combining approaches leads to better results than single algorithms

Overview

1. Problem Definition

2. Named Entity Recognition• Algorithms• Ensemble Learning

3. Relation Extraction• General approaches• OpenIE approaches

4. Entity Disambiguation• URI Lookup• Disambiguation

5. Conclusion

RE: Problem Definition• Find the relations between NEs if such relations exist.• NEs not always given a-priori (open vs. closed RE)

bornIn ([John Petrucci, PER], [New York, LOC]).

John Petrucci was born in New York.

[John Petrucci, PER] was born in [New York, LOC].

RE: Approaches• Hand-crafted rules• Pattern Learning• Coupled Learning

RE: Pattern-based• Hearst patterns [Hearst: COLING‘92]

• POS-enhanced regular expression matching in natural-language text

NP0 {,} such as {NP1, NP2, … (and|or) }{,} NPn

NP0 {,}{NP1, NP2, … NPn-1}{,} or other NPn

“The bow lute, such as the Bambara ndang, is plucked and has an individual curved neck for each string.”

isA(“Bambara ndang”, “bow lute”) Time-Efficient at runtime× Very low recall× Not adaptable to other relations

RE: DIPRE• DIPRE = Dual Iterative Pattern Relation Extraction• Semi-supervised, iterative gathering of facts and patterns• Positive & negative examples as seeds for a given target

relation• e.g. +(Hillary, Bill) ; +(Carla, Nicolas); –(Larry, Google)

• Various tuning parameters for pruning low-confidence patterns and facts

• Extended to SnowBall / QXtract

(Hillary, Bill)

(Carla, Nicolas)X and her husband Y

X and Y on their honeymoon

X and Y and their childrenX has been dating with YX loves Y

(Angelina, Brad)

(Hillary, Bill)

(Victoria, David)

(Carla, Nicolas)

(Larry, Google)…

RE: NELL• Never-Ending Language Learner (http://rtw.ml.cmu.edu/)• Open IE with ontological backbone

• Closed set of categories & typed relations

• Seeds/counter seeds (5-10)• Open set of predicate arguments

(instances)• Coupled iterative learners • Constantly running over a large Web corpus

since January 2010 (200 Mio pages)• Periodic human supervision

athletePlaysForTeam(Athlete, SportsTeam)

athletePlaysForTeam(Alex Rodriguez, Yankees)

athletePlaysForTeam(Alexander_Ovechkin, Penguins)

RE: NELL

Conservative strategy Avoid Semantic Drift

RE: BOA• Bootstrapping Linked Data (http://boa.aksw.org)• Core idea: Use instance data in Data Web to discover NL

patterns and new instances

RE: BOA• Follows conservative strategy

• Only top pattern• Frequency threshold• Score Threshold

• Evaluation results

RE: Summary• Several approaches

• Hand-crafted rules• Machine Learning• Hybrid

Large number of instances available for many relations Runtime problem Parallel implementations Many new facts can be found× Semantic Drift× Long tail× Entity Disambiguation

Overview

1. Problem Definition

2. Named Entity Recognition• Algorithms• Ensemble Learning

3. Relation Extraction• General approaches• OpenIE approaches

4. Entity Disambiguation• URI Lookup• Disambiguation

5. Conclusion

ED: Problem Definition• Given (a) refence knowledge base(s), a text fragment, a

list of NEs (incl. position), and a list a relations, find URIs for each of the NEs and relations

• Very difficult problem• Ambiguity, e.g., Paris = Paris Hilton? Paris (France)?• Difficult even for humans, e.g.,• Paris‘ mayor died yesterday

• Several solutions• Indexing• Surface Form• Graph-based

ED: Problem Definition

bornIn ([John Petrucci, PER], [New York, LOC]).

John Petrucci was born in New York.

[John Petrucci, PER] was born in [New York, LOC].:John_Petrucci dbo:birthPlace :New_York .

ED: Indexing• More retrieval than disambiguation• Similar to dictionary-based approaches• Idea

• Index all labels in reference knowledge base• Given an input label, retrieve all entities with a similar

label× Poor recall (unknown surface form, e.g., „Mme Curie“ für

„Marie Curie“)× Low precision (Paris = Paris Hilton, Paris (France), …)

ED: Type Disambiguation• Extension of indexing

• Index all labels• Infer type information• Retrieve labels from entities of the given type

• Same recall as previous approach• Higher precision

• Paris[LOC] != Paris[PER]• Still, Paris (France) vs. Paris (Ontario)

• Need for context

ED: Spotlight• Known surface forms (http://dbpedia.org/spotlight)

• Based on DBpedia + Wikipedia• Uses supplementary knowledge including disambiguation

pages, redirects, wikilinks• Three main steps

• Spotting: Finding possible mentions of DBpedia resources, e.g.,

John Petrucci was born in New York.• Candidate Selection: Find possible URIs, e.g.,

John Petrucci :JohnPetrucciNew York :New_York, :New_York_County, …

• Disambiguation: Map context to vector for each resource New York :New_York

ED: YAGO2• Joint Disambiguation

Mississippi, one of Bob’s later songs, was first recorded by Sheryl on her album.

ED: YAGO2

Mississippi (State)

Bob Dylan Songs

Sheryl Cruz

Sheryl Lee

Mississippi (Song)

Sheryl Crow

Objective: Maximize objective function (e.g., total weight)

Constraint: Keep at least one entity per mention

Mentions of Entities Entity Candidatessim(cxt(ml ),cxt(ei ))

prior(ml ,ei )

coh(ei ,ej )

ED: FOX• Generic Approach

• A-priori score (a): Popularity of URIs• Similarity score (s): Similarity of resource labels and text• Coherence score (z): Correlation between URIs

49

|a s

|a sz

ED:FOX• Allows the use of several algorithms

• HITS• Pagerank• Apriori• Propagation Algorithms• …

50

ED: Summary• Difficult problem even for humans• Several approaches

• Simple search• Search with restrictions• Known surface forms• Graph-based

Improved F-Score for DBpedia (70-80%)× Low F-Score for generic knowledge bases× Intrinsically difficult× Still a lot to do

Overview

1. Problem Definition

2. Named Entity Recognition• Algorithms• Ensemble Learning

3. Relation Extraction• General approaches• OpenIE approaches

4. Entity Disambiguation• URI Lookup• Disambiguation

5. Conclusion

Conclusion• Discussed basics of …

• Knowledge Extraction problem• Named Entity Recognition• Relation Extraction• Entity Disambiguation

• Still a lot of research necessary• Ensemble and active Learning• Entity Disambiguation• Question Answering …

Thank You!

Questions?