Srihari-CSE635-Fall 2002 CSE 635 Multimedia Information Retrieval Information Extraction.

36
Srihari-CSE635-Fall 2002 CSE 635 Multimedia Information Retrieval Information Extraction
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    216
  • download

    0

Transcript of Srihari-CSE635-Fall 2002 CSE 635 Multimedia Information Retrieval Information Extraction.

Page 1: Srihari-CSE635-Fall 2002 CSE 635 Multimedia Information Retrieval Information Extraction.

Srihari-CSE635-Fall 2002

CSE 635Multimedia Information Retrieval

Information ExtractionInformation Extraction

Page 2: Srihari-CSE635-Fall 2002 CSE 635 Multimedia Information Retrieval Information Extraction.

Srihari-CSE635-Fall 2002

Overview

Introduction to IE

Named Entity tagger HMM approach

Relationship/Event detection

Text Mining intelligence applications

Introduction to IE

Named Entity tagger HMM approach

Relationship/Event detection

Text Mining intelligence applications

Page 3: Srihari-CSE635-Fall 2002 CSE 635 Multimedia Information Retrieval Information Extraction.

Srihari-CSE635-Fall 2002

Information Extraction

What is IE The identification of instances of a particular class of events or relationships

in a natural language text, and the extraction of the relevant arguments of the event or relationship. (MUC, de facto)

Information Extraction involves the creation of a structured representation (such as a database) of selected information drawn from the text. (Grishman 1997)

identification of key entities, relationships between them, and significant activity involving these entities (Srihari)

Goals of IE transform unstructured text into structured/semi-structured text

automatic template-filling automatically populate databases facilitate information discovery

sometimes, what you don’t know is most important; if you know what you are looking for, use a search engine! IE permits information discovery

What is IE The identification of instances of a particular class of events or relationships

in a natural language text, and the extraction of the relevant arguments of the event or relationship. (MUC, de facto)

Information Extraction involves the creation of a structured representation (such as a database) of selected information drawn from the text. (Grishman 1997)

identification of key entities, relationships between them, and significant activity involving these entities (Srihari)

Goals of IE transform unstructured text into structured/semi-structured text

automatic template-filling automatically populate databases facilitate information discovery

sometimes, what you don’t know is most important; if you know what you are looking for, use a search engine! IE permits information discovery

Page 4: Srihari-CSE635-Fall 2002 CSE 635 Multimedia Information Retrieval Information Extraction.

Srihari-CSE635-Fall 2002

Information to Intelligence

UnstructuredData

Information

Intelligence

PeopleCompany

Product

INTC drops X%

Microsoft, Lockheed eye federal deals

C-bridge, eXcelon to merge

RF Micro Devices Introduces Cellular CDMA LNA and PA Driver Amplifier with Bypass Switch

Transmeta Scores Latest Crusoe Win with Sharp

Ronald Brumback Named Pres. & COO of Top Layer Networks

Top INTC executive, John Doe, leaves to join Transmeta as VP Engineering

FedEx to Cut 130 Jobs in Texas

What’s new from RFMD?

What caused INTC shares to drop?

Entities, relationships, events

Text mining, analytics

Page 5: Srihari-CSE635-Fall 2002 CSE 635 Multimedia Information Retrieval Information Extraction.

Srihari-CSE635-Fall 2002

Levels of Information Extraction

MUC identifies the following levels of extraction: Named Entity Tagging

Bill Gates is the chairman of Microsoft

Relationship Detection: leads to entity profiles

chairman-of(Bill Gates, Microsoft)

Event Detection executive change person_in, person_out company_involved date

Scenario Extraction Bombing incident where # of casualties reason follow-up events involved: ordered sequentially

MUC identifies the following levels of extraction: Named Entity Tagging

Bill Gates is the chairman of Microsoft

Relationship Detection: leads to entity profiles

chairman-of(Bill Gates, Microsoft)

Event Detection executive change person_in, person_out company_involved date

Scenario Extraction Bombing incident where # of casualties reason follow-up events involved: ordered sequentially

Page 6: Srihari-CSE635-Fall 2002 CSE 635 Multimedia Information Retrieval Information Extraction.

Srihari-CSE635-Fall 2002

Named Entity Tagging

Bridgestone Sports Co. said Friday it has set up a joint venture in Hong Kong with a local concern and a Japanese trading house to produce golf clubs to be shipped to Japan.

The joint venture, Bridgestone Sports Hong Kong Co., capitalized at 20 million Hong Kong dollars, will start production in January 1990 with production of 20,000 iron and "metal wood" clubs a month. The monthly output will be later raised to 50,000 units, Bridgestone Sports spokesman Tom White said.

The new company, based in Kaohsiung, southern Hong Kong , is owned 75 pct by Bridgestone Sports, 15 pct by Union Precision Casting Co. of Hong Kong and the remainder by Taga Co., a company active in trading with Hong Kong, the officials said.

Page 7: Srihari-CSE635-Fall 2002 CSE 635 Multimedia Information Retrieval Information Extraction.

Srihari-CSE635-Fall 2002

Output of Named Entity Tagger

<company> Bridgestone Sports Co. </company> said <date> Friday </data> it has set up a joint venture in <city>Hong Kong </city> with a local concern and a <ethnic> Japanese </ethnic> trading house to produce golf clubs to be shipped to <country> Japan </country>.

The joint venture, <company> Bridgestone Sports Hong Kong Co. </company>, capitalized at <money> 20 million Hong Kong dollars </money>, will start production in <date> January 1990 </date> with production of 20,000 iron and "metal wood" clubs a month.The monthly output will be later raised to 50,000 units, <company> Bridgestone Sports </company> spokesman <man> Tom White </man>, said.

Page 8: Srihari-CSE635-Fall 2002 CSE 635 Multimedia Information Retrieval Information Extraction.

Srihari-CSE635-Fall 2002

Named-Entity Definition

• Named-entity is a word or phrase that denotes a proper name such as person, organization, location, product, temporal expression and numerical expression.

• Name classes are associated with individual words.

• A named-entity is associated with a contiguous word sequence with the same name class.

Page 9: Srihari-CSE635-Fall 2002 CSE 635 Multimedia Information Retrieval Information Extraction.

Srihari-CSE635-Fall 2002

Entity Profiles

<Person Profile id=1>:<Person Profile id=1>:name: Waleed Alshehri

aliases: Waleed

position: a Saudi commercial pilot

age: mid-20s

gender: MALE

education: Embry - Riddle Aeronautical University;

FlightSafety Academy

associations: Satam Al Suqami ;

Wail Alshehri ;

Homing Inn;

American Flight 11

Events-involved: < graduated>;

<hijacking>;

< suicide attack>;

descriptors: quiet and private;

Middle Eastern backgrounds;

another of the eventual hijackers;

Page 10: Srihari-CSE635-Fall 2002 CSE 635 Multimedia Information Retrieval Information Extraction.

Srihari-CSE635-Fall 2002

Event Detection Event: <MOVEMENT>

who: 23 foreign fighters

whereto: into Pakistan

Location: Pakistan, Afghanistan

When: normalStr=020622 Monday

Snippet:

Pakistan said Monday its troops arrested 23 foreign fighters trying to cross from

Afghanistan into Pakistan over the weekend.

Event: <CONTRACT>

Money_involved: £5.9 million ($8.9 million)

Who: CVF Team, Thomson–CSF,Lockheed Martin, Raytheon, BMT Defense Services, Defense Procurement Agency

When: normalStr=021100 last November

Snippet:

The BAE Systems-led CVF Team and a rival Thomson-CSF group, including

Lockheed Martin, Raytheon and BMT Defense Services, were awarded parallel £5.9

million ($8.9 million) contracts by the Defense Procurement Agency last November

to undertake first-stage assessment phase work for CVF.

Event: <MOVEMENT>

who: 23 foreign fighters

whereto: into Pakistan

Location: Pakistan, Afghanistan

When: normalStr=020622 Monday

Snippet:

Pakistan said Monday its troops arrested 23 foreign fighters trying to cross from

Afghanistan into Pakistan over the weekend.

Event: <CONTRACT>

Money_involved: £5.9 million ($8.9 million)

Who: CVF Team, Thomson–CSF,Lockheed Martin, Raytheon, BMT Defense Services, Defense Procurement Agency

When: normalStr=021100 last November

Snippet:

The BAE Systems-led CVF Team and a rival Thomson-CSF group, including

Lockheed Martin, Raytheon and BMT Defense Services, were awarded parallel £5.9

million ($8.9 million) contracts by the Defense Procurement Agency last November

to undertake first-stage assessment phase work for CVF.

Page 11: Srihari-CSE635-Fall 2002 CSE 635 Multimedia Information Retrieval Information Extraction.

Srihari-CSE635-Fall 2002

3 Major Approaches to IE

Layout-based wrapper induction application focused: e.g. jobs database, processing resumes,

etc.

IR-based “concept” extraction uses techniques such as pattern matching, proximity, co-

occurrence often seen in Knowledge Management applications (e.g.

hardware)

NLP-based statistical techniques (POS tagging, NE tagging) grammatical techniques more sophisticated levels of IE possible

Layout-based wrapper induction application focused: e.g. jobs database, processing resumes,

etc.

IR-based “concept” extraction uses techniques such as pattern matching, proximity, co-

occurrence often seen in Knowledge Management applications (e.g.

hardware)

NLP-based statistical techniques (POS tagging, NE tagging) grammatical techniques more sophisticated levels of IE possible

Page 12: Srihari-CSE635-Fall 2002 CSE 635 Multimedia Information Retrieval Information Extraction.

Srihari-CSE635-Fall 2002

Convergence of NLP-driven and IR-driven Approaches to IE

InformationExtraction

Layout-based

IR-basedNLP-

Based

Entities

Relationships

Events

* Grammars

* StatisticalLanguage Models

Tag key phrases in context

Associate key phraseswith entities

* Lexical Lookups

* Word Co-0ccurence

* Heuristics

Concept Tagging

Domain-specificEvent Detection

* Expert Lexicons

* Lexicon Grammars

Generic Domain-Specific

Focus on

Precision

Focus on Recall

Page 13: Srihari-CSE635-Fall 2002 CSE 635 Multimedia Information Retrieval Information Extraction.

Srihari-CSE635-Fall 2002

Challenges in IE

Normalization temporal references (today, last year, during the Olympics …) spatial references (Buffalo)

Alias resolution George Bush, President Bush IBM, “the company”

Verb concepts kill, murder, assassinate, etc.

Diversity of sources web documents, e-mail, powerpoint, speech/OCR transcripts sophisticated pre-processing required

Cross-document information consolidation Rapid domain porting Intuitive user interface

should support decision making work flow, visualization, etc.

Normalization temporal references (today, last year, during the Olympics …) spatial references (Buffalo)

Alias resolution George Bush, President Bush IBM, “the company”

Verb concepts kill, murder, assassinate, etc.

Diversity of sources web documents, e-mail, powerpoint, speech/OCR transcripts sophisticated pre-processing required

Cross-document information consolidation Rapid domain porting Intuitive user interface

should support decision making work flow, visualization, etc.

Page 14: Srihari-CSE635-Fall 2002 CSE 635 Multimedia Information Retrieval Information Extraction.

Srihari-CSE635-Fall 2002

Homeland Defense: Track Key Entities Based on Watch Lists

Reports

Information Discovery Portal

Associations

Who/what is being associated with al-

Qaeda ?

Organizations Religious Political Terrorist - al-Jihad (34) - HAMAS (16) - Hizballah (5) - …morePeopleIncidents - Attacks (125) - Bombing (64) - Threats (45) - …moreLocationsWeaponsGovernments

Overall Coverage

Events Info. Sources Documents

Track... Organizations People Targets

al-Qaeda

Overall Coverage of al-Qaeda Over Time

0

10

20

30

40

50

# R

epor

ts

Alerts for Week of August 6, 2001

(3) new reports of al-Qaeda terrorist activity(1) new report of bin Laden sighting(4) new quotes by bin Laden(1) new target identified

Reports

Information Discovery Portal

Associations

Who/what is being associated with al-

Qaeda ?

Organizations Religious Political Terrorist - al-Jihad (34) - HAMAS (16) - Hizballah (5) - …morePeopleIncidents - Attacks (125) - Bombing (64) - Threats (45) - …moreLocationsWeaponsGovernments

Overall Coverage

Events Info. Sources Documents

Track... Organizations People Targets

al-Qaeda

Overall Coverage of al-Qaeda Over Time

0

10

20

30

40

50

# R

epor

ts

Alerts for Week of August 6, 2001

(3) new reports of al-Qaeda terrorist activity(1) new report of bin Laden sighting(4) new quotes by bin Laden(1) new target identified

DiscoverOther

Related Information

Page 15: Srihari-CSE635-Fall 2002 CSE 635 Multimedia Information Retrieval Information Extraction.

Srihari-CSE635-Fall 2002

Name-Class Definition

OR: organization CO: company

“Bridgestone Sports Co.”, “Bridgestone Sports Hong Kong Co.”, “Bridgestone Sports”

LO: location CI: city “Hong Kong”, CT: country “Japan”

PE: person MAN: man “Tom White”

TI: time DA: date “Friday”

NN : not name“said”, “it has set up a joint venture”, “with a local concern and a ”, “trading house to produce golf “

Page 16: Srihari-CSE635-Fall 2002 CSE 635 Multimedia Information Retrieval Information Extraction.

Srihari-CSE635-Fall 2002

Name-Class Tree

There are 6 top-level name-classes, and 35 sub-type name-classes.

Time -- Hour, Part Day, Duration,Frequency, Age, Day, Month, Season, Year, Decade, Century

Location -- City, Province, Country, Continent, Ocean, lake, River, Mountain, Road, Region, District, Airport

Organization --Company, Government, Association, School, Army, Mass Media

Person -- Man, Woman

Product -- Vehicle, Software

Event -- Conference, Exhibition

Page 17: Srihari-CSE635-Fall 2002 CSE 635 Multimedia Information Retrieval Information Extraction.

Srihari-CSE635-Fall 2002

Application of Named Entity Tagging

• Question-Answering System

Q: Where did Bridgestone Sports Co. set up a joint venture?

A: Hong Kong

Q: When did Bridgestone Sports Hong Kong Co. start

production?

A: January 1990

Q: Who is the spokesman for Bridgestone Sports?

A: Tom White

Page 18: Srihari-CSE635-Fall 2002 CSE 635 Multimedia Information Retrieval Information Extraction.

Srihari-CSE635-Fall 2002

Question Asking Points and Named Entities

Where Location

Q: Where did Bridgestone Sports Co. set up a joint venture?

A: Hong Kong

When Time

Q: When did Bridgestone Sports Hong Kong Co. start

production?

A: January 1990

Who Person

Q: Who is the spokesman for Bridgestone Sports?

A: Tom White

Page 19: Srihari-CSE635-Fall 2002 CSE 635 Multimedia Information Retrieval Information Extraction.

Srihari-CSE635-Fall 2002

Application of Named Entity Tagging (condt.)

Support other Information Extraction tasks

Extract Correlated Entities (relationship):

entity 1: Tom White man

relation: employed by

entity 2: Bridgestone Sports company

Extract events:

predicate: start

argument 1: Bridgestone Sports Hong Kong Co company

argument 2: production

time: January 1990 date

Page 20: Srihari-CSE635-Fall 2002 CSE 635 Multimedia Information Retrieval Information Extraction.

Srihari-CSE635-Fall 2002

Other Applications of NE

Search engines

text categorization/filtering

data mining

Search engines

text categorization/filtering

data mining

Page 21: Srihari-CSE635-Fall 2002 CSE 635 Multimedia Information Retrieval Information Extraction.

Srihari-CSE635-Fall 2002

Statistical Model for Named Entity Tagging

Given a sequence of words (W), our goal is to find the sequence of name-class (NC) with maximum Pr(NC|W).

For example:

word sequence :

it has set up a joint venture in Hong Kong

Possible name-class sequence

it has set up a joint venture in Hong Kong

NN NN NN NN NN NN NN NN LO LO

LO NN NN NN NN NN NN NN OR LO

Sequence)W |Sequence Pr(NCargmax sequence nc

Page 22: Srihari-CSE635-Fall 2002 CSE 635 Multimedia Information Retrieval Information Extraction.

Srihari-CSE635-Fall 2002

Statistical Model for Named Entity Tagging (contd.)

• Construct a manually tagged training corpus.

• Extract necessary statistics from the corpus to build a statistical model which can automatically compute Pr(NC Seqeunce | W Sequence) for unseen data.

• Search the NC sequence which maximizes the probability Pr(NC Sequence | W Sequence)

Corpus Statistical Model unseen datatagging

Page 23: Srihari-CSE635-Fall 2002 CSE 635 Multimedia Information Retrieval Information Extraction.

Srihari-CSE635-Fall 2002

Statistical Model for Named Entity Tagging (contd.)

• The size of the training corpus is large enough to provide fairly good unigram and bigram information.

unigram example: Pr(Organization | “US”)

bigram example: Pr(Orgaization | “US”, “the”)

• The size of the training corpus is too small to support any direct evaluation beyond bigram.

• Question: How to evaluate Pr(NC Sequence| Sentence) based on the above unigram and bigram information.

• One solution: transfer the conditional probability into (NC,Sentence) joint probability (Bayes’ rule)

Decouple sentence into bigram sequences (Markov assumption)

Page 24: Srihari-CSE635-Fall 2002 CSE 635 Multimedia Information Retrieval Information Extraction.

Srihari-CSE635-Fall 2002

Bayes’ Rule

Using Bayes’ rule, we have

)Sequence, NCSequence,Pr(W argmax

Sequence)Pr(W

Sequence) NCSequence,Pr(W argmax

Sequence)W |Sequence Pr(NCargmax

sequence nc

sequence nc

sequence nc

Page 25: Srihari-CSE635-Fall 2002 CSE 635 Multimedia Information Retrieval Information Extraction.

Srihari-CSE635-Fall 2002

Markov Assumption

)nc, w,...,nc,w|nc,...Pr(w

)nc,w,nc,w|nc,)Pr(wnc,w|nc,)Pr(wnc,Pr(w

)nc, w,...,nc,wnc,Pr(w

Sequence)W Sequence, Pr(NC

001-n1-nnn

001122001100

001-n1-nn,n

)nc,w|nc,Pr(w)nc, w,...,nc,w|nc,Pr(w

...............................................

)nc,w|ncPr(w)ncwnc,w|ncPr(w

1-n1-nn1-n001-n1-nn1-n

1122,00,1,122,

By Markov assumption, we have

Page 26: Srihari-CSE635-Fall 2002 CSE 635 Multimedia Information Retrieval Information Extraction.

Srihari-CSE635-Fall 2002

Markov Assumption (condt.)

So the final formula is

)nc,w|nc,...Pr(w

)nc,w|nc,)Pr(wnc,w|nc,)Pr(wnc,Pr(w

Sequence)W Sequence, Pr(NC

1-n1-nnn

1122001100

Page 27: Srihari-CSE635-Fall 2002 CSE 635 Multimedia Information Retrieval Information Extraction.

Srihari-CSE635-Fall 2002

Hidden Markov Model

Define Hidden Markov Model as follows:

1. An output alphabet Ή={0,1,…V-1}2. A state space ф={1,2,…c};3. A transition probability distribution between states and associated

output symbols p(symboln, staten | symboln-1, staten-1).

In case of named entity tagging, regard word as output symbol, and the tags as the states. The above statistical NE model is a Hidden Markov Model.

W1 W2 W3 W4 …..

<SS> PE PE PE PE

LO LO LO LO

OR OR OR OR

Page 28: Srihari-CSE635-Fall 2002 CSE 635 Multimedia Information Retrieval Information Extraction.

Srihari-CSE635-Fall 2002

Statistics Estimation

The generation of words and name-class proceeds in three steps:

1-nn1-n1-nn

1-nn1-nnn1-n1-nn1-n1-nnn ncnc)nc ,w| Pr(w

ncnc)nc,nc|wPr()nc ,w|Pr(nc)nc ,w|nc ,Pr(w

The Most Likelihood Estimation (MLE) of the above probabilities are as follows:

)nc ,C(w

)nc ,w,C(w)nc ,w| Pr(w

)nc ,C(nc

)nc ,nc,C(w)nc,nc|wPr(

)nc ,C(w

)nc ,w,C(nc)nc ,w|Pr(nc

1-n1-n

1-n1-n1-n1-n1-nn

1-nn

1-nn1-n1-nnn

1-n1-n

1-n1-nn1-n1-nn

Page 29: Srihari-CSE635-Fall 2002 CSE 635 Multimedia Information Retrieval Information Extraction.

Srihari-CSE635-Fall 2002

Easy and Difficult Cases

Some cases are easy Matsushita Electric Industrial Co. has reached agreement … Victor C. of Japan (JVC) and Sony Corp. ...

Some cases are particularly difficult: In a factory of Blaupunkt Weke, a Robert Bosch subsidiary, … Touch Panel Systems, capitalized at 50 million Yen is owned ...

Some cases are easy Matsushita Electric Industrial Co. has reached agreement … Victor C. of Japan (JVC) and Sony Corp. ...

Some cases are particularly difficult: In a factory of Blaupunkt Weke, a Robert Bosch subsidiary, … Touch Panel Systems, capitalized at 50 million Yen is owned ...

Page 30: Srihari-CSE635-Fall 2002 CSE 635 Multimedia Information Retrieval Information Extraction.

Srihari-CSE635-Fall 2002

Machine learning vs. handcrafted rules

Handcrafted finite state patterns can be very effective:

<proper-noun>+ <corporate designator> --> <corporation>e.g. Sony Corp.

Problems with handcrafted approach each new source requires tweaking, i.e. domain porting can be

tedious speech recognition transcript, OCR require modification of

rules rules for different languages are radically different

Machine learning approach more scalable exception: numerical expressions, other patterns which are

very regular, e.g. contact informationtelephone numbers, URLs, postal addresses, etc.

Handcrafted finite state patterns can be very effective:

<proper-noun>+ <corporate designator> --> <corporation>e.g. Sony Corp.

Problems with handcrafted approach each new source requires tweaking, i.e. domain porting can be

tedious speech recognition transcript, OCR require modification of

rules rules for different languages are radically different

Machine learning approach more scalable exception: numerical expressions, other patterns which are

very regular, e.g. contact informationtelephone numbers, URLs, postal addresses, etc.

Page 31: Srihari-CSE635-Fall 2002 CSE 635 Multimedia Information Retrieval Information Extraction.

Srihari-CSE635-Fall 2002

NE tagger- Bikel et al

PDF file PDF file

Page 32: Srihari-CSE635-Fall 2002 CSE 635 Multimedia Information Retrieval Information Extraction.

Srihari-CSE635-Fall 2002

Viterbi Search

Viterbi search algorithm is used to search the NC sequence which maximizes the following probability

W1 W2 W3 W4 …..

<SS> PE PE PE PE

LO LO LO LO

OR OR OR OR

Best paths reach nodes associated with w1 is self-clear.

3 paths reaches the node (W2, PE) : (PE PE –1.0), (LO,PE, -1.5), (OR,PE,-0.95). The best path reaching (W2,PE) is (OR,PE,-0.95)

Compute the best paths reaching the nodes associated with w2.

Keep the best reaching path only and continue the same computation to the next word.

-0.2

-1.2

-0.9

-0.8

-0.3

-0.05

Page 33: Srihari-CSE635-Fall 2002 CSE 635 Multimedia Information Retrieval Information Extraction.

Srihari-CSE635-Fall 2002

What next?

We know how to tag Nes locally. What next?

Alias resolution George W. Bush, President Bush, Bush

Relationship extraction affiliation spouse address

Event Detection

Entity Profiles

We know how to tag Nes locally. What next?

Alias resolution George W. Bush, President Bush, Bush

Relationship extraction affiliation spouse address

Event Detection

Entity Profiles

Page 34: Srihari-CSE635-Fall 2002 CSE 635 Multimedia Information Retrieval Information Extraction.

Srihari-CSE635-Fall 2002

Extracting relationships and events

Two major approaches grammatical statistical

Grammatical approaches requires SVO parsing, semantic parsing as a first step follow up by specialized relationship and event extraction

grammars Two approaches here also:

one behemoth grammar (CFG) cascaded, finite state grammars

Statistical approaches supervised learning approach unsupervised approach using extraction patterns

Two major approaches grammatical statistical

Grammatical approaches requires SVO parsing, semantic parsing as a first step follow up by specialized relationship and event extraction

grammars Two approaches here also:

one behemoth grammar (CFG) cascaded, finite state grammars

Statistical approaches supervised learning approach unsupervised approach using extraction patterns

Page 35: Srihari-CSE635-Fall 2002 CSE 635 Multimedia Information Retrieval Information Extraction.

Srihari-CSE635-Fall 2002

Architecture of InfoXtract Engine/Platform

DocumentProcessor

KnowledgeResources

LexiconResources

Grammars

Output Manager

Linguistic Modules

Tokenizer

Token ListLexicon Lookup

PragmaticFiltering

POS Tagging

Named EntityDetection

ShallowParsing

SemanticParsing

RelationshipDetection

NE

PE

CE

SVO

CO

Profile

GE

NumberNormalization

Alias/CoreferenceLinking

Time/locationNormalization

Profile/EventLinking

Profile/EventMerge

FST Module

Procedure orStatistical Model

HybridModule

NE: Named EntityCE: Correlated EntitySVO: Subject-Verb-ObjectCO: Co-referenceGE: General EventPE: Pre-defined EventPOS: Part Of SpeechFST: Finite State Transducer

WebServerZoned Text

Document

XML Formatted Extracted Document

HTTPPost

HTTPResponse

Document&

Error Log

ProcessManager

SourceDocument

Token List

HTTP

CORBA

Legend Natural Language Processing

Hybrid Model

Page 36: Srihari-CSE635-Fall 2002 CSE 635 Multimedia Information Retrieval Information Extraction.

Srihari-CSE635-Fall 2002

Adapting FSTs for NLP engines

Traditionally, FSTs have operated on character streams- both input and output

primarily used in lexical transducers

InfoXtract tokenizer converts input stream into tokenlist: all subsequent modules operate on tokenlist

tokenlist contains the following information: linguistic features (POS, semantic class from WordNet etc.) linguistic structures derived from NLP (e.g., SVO) information extraction output: NE, relationships, events pointers to tokens (text offsets) real objects (text strings) as well as virtual objects

FST grammars operate on tokenlists and can utilize features at several levels

character/string level, structure level equivalent to tree-walking automata

Traditionally, FSTs have operated on character streams- both input and output

primarily used in lexical transducers

InfoXtract tokenizer converts input stream into tokenlist: all subsequent modules operate on tokenlist

tokenlist contains the following information: linguistic features (POS, semantic class from WordNet etc.) linguistic structures derived from NLP (e.g., SVO) information extraction output: NE, relationships, events pointers to tokens (text offsets) real objects (text strings) as well as virtual objects

FST grammars operate on tokenlists and can utilize features at several levels

character/string level, structure level equivalent to tree-walking automata