NLP in Web Data Extraction (Omer Gunes)

Reading Course

Omer Gunes, Somerville CollegeSupervisor : Professor Georg Gottlob

Article 1: “Ontology Based Information Extraction”Article 2: “YAGO : Yet Another Great Ontology”Article 3: “Sophie : A Self-Organized Framework for Information Extraction”

D. Phil. In Computer ScienceComputing Laboratory, University of Oxford

DefinitionsInformation Extraction

Russell and Norvig;

● Information Extraction

● is automatically retrieving certain types of information from NL texts

● aims

– processing natural language texts

– retrieving occurrences of ● a particular class of objects● relationships between these objects

● lies between

– Information retrieval systems

– Text understanding systems

Riloff;

● Information Extraction is a form of NLP in which certain types of information must be

● recognized

● extracted from text

DefinitionsOBIE : Ontology Based Information Extraction

● OBIE has recently emerged as a subfield of IE.

● An OBIE system

● Processes

– Unstructured– Structured

NL text through a mechanism guided by ontologies● Extracts certain types of information● Presents the output using ontologies

● An OBIE system is a set of information extractors each extracting

● Individuals for a class● Property values for a property

OBIEKey Characteristics of OBIE Systems

● Process unstructured natural language text

e.g. text files

Process semi-structured natural language text

e.g. web pages using a particular template such as Wikipedia pages

● Present the output using ontologies

● The use of a formal ontology as one of the system inputs and the target output

● Constructing the ontology to be used through the IE process● Use an IE process guided by an ontology

● The IE process guided by the ontology to extract things such as classes, properties and instances

OBIEPotential of OBIE Systems

● Automatically processing the information contained in natural language text

● A large fraction of the information contained in the WWW takes the form of natural language text

● Around %80 of the data of a typical corporation are in natural language

● Creating semantic contexts for the Semantic Web

● The success of semantic web relies heavily on the existence of semantic contents that can be processed by software agents

the creation of such contents are quite slow● Automatic meta-data creation would be the snowball to

unleash an avalanche of metadata through the web.

OBIECommon Architecture of an OBIE System

Ontology Editor

Ontology GeneratorOntology

SemanticLexicon

Information Extraction

Module

Preprocessor

Text Input

Knowledge Base

/ Database

QueryAnswer System

User

Extracted Information

OBIE System

Classification of OBIE Systems Information Extraction Methods

● Linguistic Rules

● They are represented by regular expressions.

e.g. (watched|seen) <NP>

● Regular expressions are combined with the elements of ontologies such as classes and properties.

● Gazetteer Lists

● Relies on finite-state automata

● Recognizes individual words or phrases instead of patterns

● Consists of a list of a set of words which identifies individual entities of a particular category

e.g. the states of US, the countries of the world

2 conditions to construct a gazetteer list

● Specify exactly what is being identified by the gazetteer

● Specify where the information for the gazetteer list was obtained from

Classification of OBIE Systems Information Extraction Methods

● Supervised classification techniques for Information Extraction

● Support Vector Machines

● Maximum Entropy Models

● Decision Trees

● Hidden Markov Models

● Conditional Random Fields

● Classifiers are trained to identify different components of an ontology such as instances and property values

e.g. In Kylin, which is a tool developed by University of Washington, they use

● CRF model to identify attributes within a sentence

● ME model to predict which attribute values are present in a sentence.

Classification of OBIE Systems Analyzing HTML/XML tags

● Some OBIE systems use html or xml pages as input

They can extract certain types of information from tables present in html pages.

e.g. The SOBA system which is developed at Karlsruhen Institut for Tech.

Extracts information from HTML tables into a knowledgebase that uses F-Logic.

Classification of OBIE Systems Ontology Construction and Update

2 approaches

● Considering the ontology as an input to the system.

Ontology can be constructed manually.

An “off-the-shelf” ontology can also be used.

Systems adopting this approach :

● SOBA

● KIM

● PANKOW

● The paradigm of open information extraction which tells to construct an ontology as a part of the OBIE process.

Systems adopting the approaches :

● Text-to-Onto

● Kylin

Classification of OBIE Systems Summary of the classification of OBIE Systems

Classification of OBIE Systems Tools used

Shallow NLP tools

● GATE

● SproUT

● Stanford NLP Group

● Center For Intelligent Information Retrieval (CIIR) at the Univ. of Massachusetts

● Saarbrücken Message Extracting System

Semantic lexicons

● WordNet

● GermaNet for German

● Hindi WordNet for Hindi

Ontology Editors

● Protege

● OntoEdit

YAGO Introduction

● A light weight & extensible ontology with high coverage & quality.

● 1 million entities

5 million facts automatically extracted from Wikipedia unified with WordNet

● A carefully designed combination of heuristic & rule-based methods

● Contributions:

A major step beyond WordNet both in quality and in quantity

Empirical evaluation of fact correctness is %95.

● It is based on a clean model which is decidable, extensible and compatible with RDFS.

YAGOBackground & Motivation

The need for

a huge ontology with knowledge from several sources

an ontology of high quality with accuracy close to 100 percent

an ontology have to comprise concepts not only in the style of WordNet but also named entities like

● People

● Organizations

● Books

● Songs

● Products

An ontology have to be

● Extensible

● Easily re-usable

● Application independent

YAGO Yet Another Great Ontology

● YAGO is based on Wikipedia's category pages.

Drawback : Wikipedia's bareness of hierarchies

Counterwise : WordNet provides a clean & carefully assembled hierarchy of thousands of concepts

Problem : There is no mapping between Wikipedia and WordNet concepts

Proposal : New techniques to link them with perfect accuracy

● Contribution : Accomplishing the unification between WordNet and facts from Wikipedia with an accuracy of 97%.

A data model which is based on entities & binary relations.

● General properties of relations and relations between relations can also be expressed.

● It is designed to be extendable by other resources

● Other high quality resources

● Domain specific extensions

● Data gathered through Information Extraction

YAGO Model

The state-of-the art formalism in knowledge representation

● OWL-Full : It can express properties of relations. It is decidable.

● OWL-Lite & OWL-DL : It can not express relations between facts

● RDFS : It can express relations between facts. It provides only very primitive semantics (e.g. does not know transitivity)

● YAGO Model : It is an extension of RDFS

All objects are represented as entities.

e.g. AlbertEinstein

Two entities can stand in a relation.

e.g. AlbertEinstein HASWONPRIZE NobelPrize

number, dates, string and other literals are represented as entities as well.

e.g. AlbertEinstein BORNINYEAR 1879

YAGO Model

● Words are entities as well.

e.g. “Einstein” MEANS AlbertEinstein

“Einstein” MEANS AlfredEinstein

● Similar entities are grouped into classes.

Each entity is an instance of at least one class. This is expressed by the TYPE relation.

e.g. AlbertEinstein TYPE physicist

● Classes are also entities

Each class is an instance of a class called class

Classes are expressed in a taxonomic hiearchy, expressed by the subClassOf relation.

e.g. physicist subClassOf scientist

YAGO Model

● Relations are entities as well.

The properties of relations can be represented within the model.

e.g. subClassOf TYPE transitiveRelation

● A fact is a triple of

● an entity

● a relation

● an entity

● Two entities are called the arguments of the fact.

Each fact is given a fact identifier which is also an entity.

Assumption : #1 is the identifier of the fact (AlbertEinstein, BORNINYEAR, 1879)

then #1 FOUNDIN http://wikipedia.org/Einstein

● Common entities are entities which are neither facts nor relations.

Common entities that are not classes are called individuals.

http://wikipedia.org/Einstein

YAGO Ontology

● C : a finite set of common entities

R : a finite set of relation names

I : a finite set of fact identifiers

A YAGO ontology y is an injective & total function which is

● The set of relation names R for any YAGO ontology must contain at least :

● Type

● subClassOf

● Domain

● Range

● subRelationOf

y : I I∪C∪R∗R∗ I∪C∪R

YAGO Ontology

The set of class names C for any YAGO ontology must contain at least :

● Entity

● Class

● Relation

● aCyclicTransitiveRelation

● Classes for all literals

YAGO Knowledge Extraction – The TYPE Relation

● Each Wiki page title is a candidate to become an individual in YAGO

● The Wikipedia Category System :

Different types of categories

● Conceptual categories

e.g. Naturalized citizens of United States

● Administrative categories

e.g. Articles with unsourced statements

● Relational categories

e.g. 1879 births

● Thematic vicinity categories

e.g. Physics

YAGO Knowledge Extraction – The MEANS Relation

● WordNet reveals the meaning of words by its synsets.

“urban center” & “metropolis” belong to the “city” synset

(“metropolis”, MEANS, city)

● Wikipedia redirect system gives alternative names for the entities.

(“Einstein, Albert”, MEANS, Albert Einstein)

● The relations GivenNameOf & FamilyNameOf are sub-relations of MEANS

A name parser is used to identify and to decompose person names.

(“Einstein”, FamilyNameOf, AlbertEinstein)

YAGO Evaluation – Accuracy of YAGO

YAGO Evaluation – Size of YAGO (facts)

YAGO Evaluation – Size of YAGO (entities) and Size of Other Ontologies

YAGO Evaluation – Sample facts on YAGO

YAGO Evaluation – Sample queries on YAGO

SOFIEBackground & Motivation

● 3 systems; YAGO, Kylin/KOG, Dbpedia;

● used IE methods for constructing large ontologies.

● leveraged high-quality hand crafted sources with latent knowledge for collecting individual entities & facts.

● combined there results with a taxonomical hierarchy like, WordNet & SUMO.

● Empirical assessment has shown that these approaches have an acc. > 95.

● They are close to the best hand-crafted rules.

● The resulting ontologies

● contain

– millions of entities

– 10s of millions of facts● organized in a consistent manner by a transitive & acyclic subclass

relation.


● Next Stage : Expanding&Maintaining automatically compiled ontologies as knowledge keeps evolving.

● Wikipedia's semi-structured knowledge is huge but limited.

● NL text sources such as news articles, biographies, scientific pubs., full text Wiki articles must be brought into scope.

● Existing %80 accuracy is unacceptable for an ontological knowledge base.

● Key Idea : To leverage existing ontology for its own growth :

● use trusted facts as a basis for generating good patterns.● Scrutinize the resulting hypotheses with regard to their

consistency with the already knows facts.


● 3 systems; YAGO, Kylin/KOG, Dbpedia;

● used IE methods for constructing large ontologies.

● leveraged high-quality hand crafted sources with latent knowledge for collecting individual entities & facts.

● combined there results with a taxonomical hierarchy like, WordNet & SUMO.

● Empirical assessment has shown that these approaches have an acc. > 95.

● They are close to the best hand-crafted rules.

● The resulting ontologies

● contain

– millions of entities

– 10s of millions of facts● organized in a consistent manner by a transitive & acyclic subclass

relation.

SOFIEContribution

● 3 problems;

● Pattern selection● Entity disambiguation● Consistency checking

● Proposed Approach :

● A unified model for ontology oriented IE solving all 3 issues simultaneously.

● They cast known facts, hypotheses for new facts, word-to-entity mappings, gathered sets of patterns, a configurable set of semantic constraints into a unified framework of logical clauses.

● All 3 problems can be seen as a Weighted MAX-SAT problem, which is task of identfying a maximal set of consisten clauses.

SOFIEContribution

Implementation :

● A new model for consistent growth of a large ontology.

● A unified model for

● Pattern selection● Entity disambiguation● Consistency checking● Identification of the best hypotheses for new facts

● An efficient algorithm for the resulting Weighted MAX-SAT algorithm.

● Experiments with a variety of real-life textual & semi-structured sources to demonstrate the scalability & high-accuracy of the approach.

SOFIEContribution

● 3 problems;

● Pattern selection● Entity disambiguation● Consistency checking

● Proposed Approach :

● A unified model for ontology oriented IE solving all 3 issues simultaneously.

● They cast known facts, hypotheses for new facts, word-to-entity mappings, gathered sets of patterns, a configurable set of semantic constraints into a unified framework of logical clauses.

● All 3 problems can be seen as a Weighted MAX-SAT problem, which is task of identfying a maximal set of consisten clauses.

SOFIEModel

Assumption :

● An existing ontology

● A given corpus of documents

Statements :

● A word in context (wic) is a pair of word & context.

● The context of a word is simply the document in which the word appears. Thus, a wic is a pair of word & document. The notation being used is word@document.

● Wics are entities as well.

● Each statement can have an associated truth value of 1|0.

SOFIEModel

Statements :

● A prefix notation is used for statements as SOPHIE deals with relations of arbitrary arity.

e.g. bornIn(Albert Einstein, Ulm)[1]

A fact is a statement with truth value 1.

A statement with an unknown truth value is called a hypothesis.

● The ontology is considered as a set of facts.

e.g. bornOnDate(AlbertEinstein, 1879-03-14)[1]

● SOFIE will extract textual information from the corpus.

Textual information takes the forms of facts.

One type of facts makes assertions about the number of times that a pattern occurred with two wics.

e.g. patternOcc(“X went to school in Y”, Einstein@29, Germany@29)[1]

mailto:Einstein@29

SOFIEModel

Statements :

● Another type of facts. How likely it is from a linguistic point of view that a wic refers to a certain entity. It is called a “disambiguation prior”.

e.g. disambiguation prior of the wic Elvis@29

disambPrior(Elvis@29, Elvis Presley, 0.8)[1]

disambPrior(Elvis@29, Elvis Costello, 0.2)[1]

● Hypotheses that are formed by SOFIE

● Can concern the disambiguation of wics,

e.g. disambigauteAs(Java@DS, JavaProgrammingLanguage)[?]

● Can hypothesize about whether a certain pattern expresses a certain relation expresses (“X was born in Y”, bornInLocation)

● Can hypothesize about potential new facts

e.g. developed(Microsoft, JavaProgrammingLanguage)[?]

● Unifying Framework :

● SOFIE unifies the domains of ontology & information extraction.

● For SOFIE, there exist only statements.

● SOFIE will try to figure out which hypotheses are likely to be true.

mailto:Elvis@29

mailto:Elvis@29

mailto:Elvis@29

mailto:Java@DS

NLP in Web Data Extraction (Omer Gunes)

Documents

Transcript of NLP in Web Data Extraction (Omer Gunes)