WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 14.

WEB BAR 2004 Advanced Retrieval and Web Mining

Lecture 14

Today’s Topic: Information Extraction

Information extraction: Intro Hidden markov models Evaluation and outlook

Text classification vs. information extraction

?TC

IE

Information Extraction: Definition

Given Unstructured text

or slightly structured such as html A template with “slots”

Common slots: author, date, location, company

Information extraction task Analyze document Fill template slots with values extracted

from document Author: Smith, date: 30. Aug, location: Rome,

company: IBM etc.

Classified Advertisements (Real Estate)

Background: Advertisements

are plain text Text is lowest

common denominator: only thing that 70+ newspapers with 20+ publishing systems can all handle

<ADNUM>2067206v1</ADNUM><DATE>March 02, 1998</DATE><ADTITLE>MADDINGTON

$89,000</ADTITLE><ADTEXT>OPEN 1.00 - 1.45 U 11 / 10 BERTRAM ST NEW TO MARKET Beautiful 3 brm freestanding villa, close to shops & bus Owner moved to Melbourne ideally suit 1st home buyer, investor & 55 and over. Brian Hazelden 0418 958 996 R WHITE LEEMING 9332 3477</ADTEXT>

Extracting Job Openings from the Web

foodscience.com-Job2

JobTitle: Ice Cream Guru

Employer: foodscience.com

JobCategory: Travel/Hospitality

JobFunction: Food Services

JobLocation: Upper Midwest

Contact Phone: 800-488-2611

DateExtracted: January 8, 2001

Source: www.foodscience.com/jobs_midwest.html

OtherCompanyJobs: foodscience.com-Job1

‘Change of Address’ email

Modify addressbook, etc.

• For email messages that communicate a change of email address:• Automatically extract the new email

Product information

Product info

This is valuable information that is sold by some companies

How do they get most of it?

Phone calls Typing

Other applications of IE Systems

Job resumes: BurningGlass, Mohomine Seminar announcements Molecular biology information from MEDLINE, e.g,

Extracting gene drug interactions from biomed texts Summarizing medical patient records by extracting

diagnoses, symptoms, physical findings, test results. Gathering earnings, profits, board members, etc.

[corporate information] from web, company reports Verification of construction industry specifications

documents (are the quantities correct/reasonable?) Extraction of political/economic/business changes

from newspaper articles

Why doesn’t text search (IR) work?

What you search for in real estate advertisements:

Location: which Suburb. You might think easy, but: Suburb not mentioned Phrases: Only 45 minutes from Parramatta Multiple properties in different suburbs in one ad

Money: want a range not a textual match Multiple amounts: was $155K, now $145K Variations: offers in the high 700s [but not rents

for $270] Bedrooms: similar issues (br, bdr, beds, B/R)

Why doesn’t text search (IR) work?

Image Capture Device: 1.68 million pixel 1/2-inch CCD sensor

Image Capture Device Total Pixels Approx. 3.34 million Effective Pixels Approx. 3.24 million

Image sensor Total Pixels: Approx. 2.11 million-pixel Imaging sensor Total Pixels: Approx. 2.11 million 1,688

(H) x 1,248 (V) CCD Total Pixels: Approx. 3,340,000 (2,140[H] x 1,560

[V] ) Effective Pixels: Approx. 3,240,000 (2,088 [H] x 1,550 [V] ) Recording Pixels: Approx. 3,145,000 (2,048 [H] x 1,536 [V] )

These all came off the same manufacturer’s website!!

And this is a very technical domain. Try sofa beds.

Task: Information Extraction

Goal: being able to answer semantic queries (a.k.a. “database queries”) using “unstructured” natural language sources

Identify specific pieces of information in an un-structured or semi-structured textual document.

Transform this unstructured information into structured relations in a database/ontology.

Suppositions: A lot of information that could be represented in

a structured semantically clear format isn’t It may be costly, not desired, or not in one’s

control (screen scraping) to change this.

Amazon Book Description….</td></tr></table>The Age of Spiritual Machines : When Computers Exceed Human Intelligence by <a href="/exec/obidos/search-handle-url/index=books&field-author= Kurzweil%2C%20Ray/002-6235079-4593641">Ray Kurzweil</a> <a href="http://images.amazon.com/images/P/0140282025.01.LZZZZZZZ.jpg"><img src="http://images.amazon.com/images/P/0140282025.01.MZZZZZZZ.gif" width=90 height=140 align=left border=0></a>List Price: $14.95 Our Price: $11.96 You Save: $2.99 (20%) 

….</td></tr></table>The Age of Spiritual Machines : When Computers Exceed Human Intelligence by <a href="/exec/obidos/search-handle-url/index=books&field-author= Kurzweil%2C%20Ray/002-6235079-4593641">Ray Kurzweil</a> <a href="http://images.amazon.com/images/P/0140282025.01.LZZZZZZZ.jpg"><img src="http://images.amazon.com/images/P/0140282025.01.MZZZZZZZ.gif" width=90 height=140 align=left border=0></a>List Price: $14.95 Our Price: $11.96 You Save: $2.99 (20%) …

Extracted Book TemplateTitle: The Age of Spiritual Machines : When Computers Exceed Human IntelligenceAuthor: Ray KurzweilList-Price: $14.95Price: $11.96::

Web Extraction

Many web pages are generated automatically from an underlying database.

Therefore, the HTML structure of pages is fairly specific and regular (semi-structured).

However, output is intended for human consumption, not machine interpretation.

An IE system for such generated pages allows the web site to be viewed as a structured database.

An extractor for a semi-structured web site is sometimes referred to as a wrapper.

Process of extracting from such pages is sometimes referred to as screen scraping.

Simple Extraction Patterns

Specify an item to extract for a slot using a regular expression pattern.

Price pattern: “\b\$\d+(\.\d{2})?\b” May require preceding (pre-filler) pattern to

identify proper context. Amazon list price:

Pre-filler pattern: “List Price: ” Filler pattern: “\$\d+(\.\d{2})?\b”

May require succeeding (post-filler) pattern to identify the end of the filler.

Amazon list price: Pre-filler pattern: “List Price: ” Filler pattern: “.+” Post-filler pattern: “”

Simple Template Extraction

Extract slots in order, starting the search for the filler of the n+1 slot where the filler for the nth slot ended. Assumes slots always in a fixed order.

Title Author List price …

Make patterns specific enough to identify each filler always starting from the beginning of the document.

Pre-Specified Filler Extraction

If a slot has a fixed set of pre-specified possible fillers, text categorization can be used to fill the slot. Job category Company type

Treat each of the possible values of the slot as a category, and classify the entire document to determine the correct filler.

Natural Language Processing

If extracting from automatically generated web pages, simple regex patterns usually work.

If extracting from more natural, unstructured, human-written text, some NLP may help.

Part-of-speech (POS) tagging Mark each word as a noun, verb, preposition, etc.

Syntactic parsing Identify phrases: NP, VP, PP

Semantic word categories (e.g. from WordNet) KILL: kill, murder, assassinate, strangle, suffocate

Extraction patterns can use POS or phrase tags. Crime victim:

Prefiller: [POS: V, Hypernym: KILL] Filler: [Phrase: NP]

Learning for IE

Writing accurate patterns for each slot for each domain (e.g. each web site) requires laborious software engineering.

Alternative is to use machine learning: Build a training set of documents paired with human-

produced filled extraction templates. Learn extraction patterns for each slot using an

appropriate machine learning algorithm. Rapier system learns three regex-style patterns for

each slot: Pre-filler pattern Filler pattern Post-filler pattern

Evaluating IE Accuracy Always evaluate performance on independent,

manually-annotated test data not used during system development.

Measure for each test document: Total number of correct extractions in the solution

template: N Total number of slot/value pairs extracted by the

system: E Number of extracted slot/value pairs that are correct

(i.e. in the solution template): C Compute average value of metrics adapted from IR:

Recall = C/N Precision = C/E F-Measure = Harmonic mean of recall and precision

XML and IE

If relevant documents were all available in standardized XML format, IE would be unnecessary.

But… Difficult to develop a universally adopted DTD format

for the relevant domain. Difficult to manually annotate documents with

appropriate XML tags. Commercial industry may be reluctant to provide data

in easily accessible XML format. IE provides a way of automatically transforming semi-

structured or unstructured data into an XML compatible format.

Web Extraction using DOM Trees

Web extraction may be aided by first parsing web pages into DOM trees.

Extraction patterns can then be specified as paths from the root of the DOM tree to the node containing the text to extract.

May still need regex patterns to identify proper portion of the final CharacterData node.

Sample DOM Tree ExtractionHTML

BODY

FONTB

Age of Spiritual Machines

Ray Kurzweil

Element

Character-DataHEADER

by A

Title: HTMLBODYBCharacterDataAuthor: HTML BODYFONTA CharacterData

Shop Bots

One application of web extraction is automated comparison shopping systems.

System must be able to extract information on items (product specs and prices) from multiple web stores.

User queries a single site, which integrates information extracted from multiple web stores and presents overall results to user in a uniform format, e.g. ordered by price.

Several commercial systems: MySimon Cnet BookFinder

Shop Bots (cont.)

Construct wrapper for each source web store.Accept shopping query from user.For each source web store: Submit query to web store. Obtain resulting HTML page. Extract information from page and store in local DB.Sort items in resulting DB by price.Format results into HTML and return result.

Alternative is to extract information from all web stores in advance and store in a uniform global DB for subsequent query processing.

Information Integration

Answering certain questions using the web requires integrating information from multiple web sites.

Information integration concerns methods for automating this integration.

Requires wrappers to accurately extract specific information from web pages from specific sites.

Treat each wrapped site as a database table and answer complex queries using a database query language (e.g. SQL).

Information Integration Example

Question: What is the closest theater to my home where I can see both Monsters Inc. and Harry Potter?

From austin360.com, extract theaters and their addresses where Harry Potter and Monster’s Inc. are playing.

Intersect the two to find the theaters playing both. Query mapquest.com for driving directions from your

home address to the address of each of these theaters.

Extract distance and driving instructions for each. Sort results by driving distance. Present driving instructions for closest theater.

Knowledge Extraction VisionMulti-

dimensional Meta-data Extraction J F M A M J J A

EMPLOYEE / EMPLOYER Relationships:Jan Clesius works for Clesius EnterprisesBill Young works for InterMedia Inc.COMPANY / LOCATION Relationshis:Clesius Enterprises is in New York, NYInterMedia Inc. is in Boston, MA

Meta-Data

India Bombing NY Times Andhra Bhoomi Dinamani Dainik Jagran

Topic Discovery

Concept Indexing

Thread Creation

Term Translation

Document Translation

Story Segmentation

Entity Extraction

Fact Extraction

Knowledge Extraction Vision The vision is to have relational metadata

associated with each document / web page. A search engine could then combine the power of

information retrieval with the power of an RDBMS. Some search engine queries that this would

enable: Find web pages about John Russ Find web pages about books authored by Queen

Elizabeth Find web pages about the person who assassinated

Lee Harvey Oswald Exercise: Why are these hard queries for current

search engine technology?

What is an HMM?

Graphical Model Representation: Variables by time Circles indicate states Arrows indicate probabilistic dependencies

between states

What is an HMM?

Green circles are hidden states Dependent only on the previous state: Order-1

Markov process “The past is independent of the future given the

present.”

What is an HMM?

Purple nodes are observed states Dependent only on their corresponding hidden

state

HMM Formalism

{S, K, are the initial state probabilities A = {aij} are the state transition probabilities B = {bik} are the observation state probabilities

A

B

AAA

BB

SSS

KKK

S

K

S

K

Applying HMMs to IE

Document generated by a stochastic process modelled by an HMM

Token word State “reason/explanation” for a given token

‘Background’ state emits tokens like ‘the’, ‘said’, … ‘Money’ state emits tokens like ‘million’, ‘euro’, … ‘Organization’ state emits tokens like ‘university’,

‘company’, … Extraction: via the Viterbi algorithm, a dynamic

programming technique for efficiently computing the most likely sequence of states that generated a document.

Create Bibliographic Entry

Leslie Pack Kaelbling, Michael L. Littman

and Andrew W. Moore. Reinforcement

Learning: A Survey. Journal of Artificial

Intelligence Research, pages 237-285,

May 1996.

Bibliographic Entry Headers

HMM for research papers: emissions [Seymore et al., 99]

author title institution

Trained on 2 million words of BibTeX data from the Web

...note

ICML 1997...submission to…to appear in…

stochastic optimization...reinforcement learning…model building mobile robot...

carnegie mellon university…university of californiadartmouth college

supported in part…copyright...

HMM for research papers: transitions [Seymore et al., 99]

Inference for an HMM

Analyze a sequence: Compute the probability of a given observation sequence

Applying the model: Given an observation sequence, compute the most likely hidden state sequence

Learning the model: Given an observation sequence and set of possible models, which model most closely fits the data?

)|( Compute

),,( ,),...,( 1

OP

BAooO T

oTo1 otot-1 ot+1

Given an observation sequence and a model, compute the probability of the observation sequence

Sequence Probability

Learning HMMs Good news: If training data tokens are tagged with their

generating states, then simple frequency ratios are a maximum-likelihood estimate of transition/emission probabilities. Easy. (Use smoothing to avoid zero probs for emissions/transitions absent in the training data.)

Great news: Baum-Welch algorithm trains an HMM using partially labeled or unlabelled training data.

Bad news: How many states should the HMM contain? How are transitions constrained?

Only semi-good answers to finding answer automatically Insufficiently expressive Unable to model important

distinctions (long distance correlations, other features) Overly expressive sparse training data, overfitting

Learning: Supervised vs Unsupervised

Supervised If you have a training set Computation of parameters is simple (but

need to use smoothing) Unsupervised

EM / Forward-Backward Usually need to start from a model trained

in a supervised manner Unlabeled data can further improve a good

model

Statistical generative models

Rapier uses explicit extraction patterns/rules Hidden Markov Models are a powerful alternative based

on statistical token sequence generation models rather than explicit extraction patterns.

Pros: Well-understood underlying statistical model makes it easy

to use wide range of tools from statistical decision theory Portable, broad coverage, robust, good recall

Cons: Range of features and patterns usable is limited

Memory of 1 for customarily used HMMs Not necessarily as good for complex multi-slot patterns

More About Information Extraction

Three generations of IE systems

Hand-Built Systems – Knowledge Engineering [1980s– ] Rules written by hand Require experts who understand both the systems and the

domain Iterative guess-test-tweak-repeat cycle

Automatic, Trainable Rule-Extraction Systems [1990s– ] Rules discovered automatically using predefined templates

and methods like ILP Require huge, labeled corpora (effort is just moved!)

Statistical Generative Models [1997 – ] One decodes the statistical model to find which bits of the

text were relevant, using HMMs or statistical parsers Learning usually supervised; may be partially unsupervised

Trainable IE systems

Pros Annotating text is

simpler & faster than writing rules.

Domain independent Domain experts don’t

need to be linguists or programmers.

Learning algorithms ensure full coverage of examples.

Cons Hand-crafted systems

perform better, especially at hard tasks.

Training data might be expensive to acquire

May need huge amount of training data

Hand-writing rules isn’t that hard!!

MUC: the genesis of IE

DARPA funded significant efforts in IE in the early to mid 1990’s.

Message Understanding Conference (MUC) was an annual event/competition where results were presented.

Focused on extracting information from news articles:

Terrorist events Industrial joint ventures Company management changes

Information extraction of particular interest to the intelligence community (CIA, NSA).

Natural Language Processing If extracting from automatically generated web

pages, simple regex patterns usually work. If extracting from more natural, unstructured,

human-written text, some NLP helps. Part-of-speech (POS) tagging

Mark each word as a noun, verb, preposition, etc. Syntactic parsing

Identify phrases: NP, VP, PP Semantic word categories (e.g. from WordNet)

PRICE: price, amount, cost, … Extraction patterns can use POS or phrase tags.

Company location Prefiller: [POS: IN, Word: “in”] Postfiller: [POS:, Semantic Category: US State]

What about XML? Don’t XML, RDF, OIL, SHOE, DAML, XSchema, the

Semantic Web … obviate the need for information extraction?!??!

Yes: IE is sometimes used to “reverse engineer” HTML

database interfaces; extraction would be much simpler if XML were exported instead of HTML.

Ontology-aware editors will make it easier to enrich content with metadata.

No: Terabytes of legacy HTML. Data consumers forced to accept ontological decisions

of data providers (eg, <NAME>John Smith</NAME> vs.<NAME first="John" last="Smith"/> ).

Will you annotate every email you send? Every memo you write? Every photograph you scan?

Evaluating IE Accuracy

Always evaluate performance on independent, manually-annotated test data not used during system development.

Example: extract job ads from web Measure for each test document:

Total number of job titles occurring in the test data: N Total number of job title candidates extracted by the

system: E Number of extracted job titles that are correct (i.e. in the

solution template): C Compute average value of metrics adapted from IR:

Recall = C/N Precision = C/E F-Measure = Harmonic mean of recall and precision

MUC Information Extraction:State of the Art c. 1997

NE – named entity recognitionCO – coreference resolutionTE – template element constructionTR – template relation constructionST – scenario template production

Take Away

Information extraction (IE) vs Text classification Text classification assigns a class to entire doc Information extraction

Extracts phrases from a document Classifies the function of the phrase (author, title,…)

We’ve looked at the “fragment extraction” task. Future?

Better ways of using domain knowledge More NLP, e.g. syntactic parsing

Information extraction beyond fragment extraction: Anaphora resolution, discourse processing, ... Fragment extraction is good enough for many Web

information services!

Take Away (2)

Learning IE extractors with HMMs HMMs are generative models Training: Forward/Backward; Application:

Viterbi HMMs can be trained on unlabeled text In practice, labeled text is usually needed Indirect / partial labels are key HMM topology can also be learned

Applications: What exactly is IE good for? Is there a use for today’s “60%” results?

67% in recent KDD Cup 90% accurate IE could be a revolutionizing

technology

Good Basic IE References

Douglas E. Appelt and David Israel. 1999. Introduction to Information Extraction Technology. IJCAI 1999 Tutorial. http://www.ai.sri.com/~appelt/ie-tutorial/.

Kushmerick, Weld, Doorenbos: Wrapper Induction for Information Extraction,IJCAI 1997. http://www.cs.ucd.ie/staff/nick/.

Stephen Soderland: Learning Information Extraction Rules for Semi-Structured and Free Text. Machine Learning 34(1-3): 233-272 (1999)

More References

Mary Elaine Califf and Raymond J. Mooney: Relational Learning of Pattern-Match Rules for Information Extraction. In AAAI 1999: 328-334.

Leek, T. R. 1997, Information Extraction using Hidden Markov Models, Master’s thesis, UCSD

Bikel, D. M.; Miller, S; Schwartz, R.; and Weischedel, R. 1997, Nymble: a high-performance learning name-finder. In Proceedings of ANLP-97, 194-201. [Also in MLJ 1999]

Kristie Seymore, Andrew McCallum, Ronald Rosenfeld, 1999, Learning Hidden Markov Model Structure for Information Extraction, In Proceedings if the AAAI-99 Workshop on ML for IE.

Dayne Freitag and Andrew McCallum, 2000, Information Extraction with HMM Structures Learned by Stochastic Optimization. AAAI-2000.

Rapier: A Rule-Based System

Machine Learning Approach

Motivation: Writing accurate patterns for each slot for each domain (e.g. each web site) requires laborious software engineering.

Alternative is to use machine learning: Build a training set of documents paired

with human-produced filled extraction templates.

Learn extraction patterns for each slot using an appropriate machine learning algorithm.

Automatic Pattern-Learning Systems

Pros: Portable across domains Tend to have broad coverage Automatically finds appropriate patterns System knowledge not needed by those who supply the

domain knowledge. Cons:

We need annotated training data, and lots of it Isn’t necessarily better or cheaper than hand-built sol’n

Examples: Riloff et al., AutoSlog (UMass); Soderland WHISK (UMass); Mooney et al. Rapier (UTexas):

learn lexico-syntactic patterns from templates

Trainer

Decoder

Model

LanguageInput

Answers

AnswersLanguageInput

Rapier [Califf & Mooney, AAAI-99]

Rapier learns three regex-style patterns for each slot:Pre-filler pattern Filler pattern Post-filler pattern

One of several recent trainable IE systems that incorporate linguistic constraints. (See also: SIFT [Miller et al, MUC-7]; SRV [Freitag, AAAI-98]; Whisk [Soderland, MLJ-99].)

RAPIER rules for extracting “transaction price”

“…paid $11M for the company…”“…sold to the bank for an undisclosed

amount…”“…paid Honeywell an undisclosed price…”

Part-of-speech tags & Semantic classes

Part of speech: syntactic role of a specific word noun (nn), proper noun (nnp), adjectve (jj), adverb (rb),

determiner (dt), verb (vb), “.” (“.”), … NLP: Well-known algorithms for automatically assigning POS

tags to English, French, Japanese, … (>95% accuracy)

Semantic Classes: Synonyms or other related words

“Price” class: price, cost, amount, … “Month” class: January, February, March, …, December “US State” class: Alaska, Alabama, …, Washington,

Wyoming WordNet: large on-line thesaurus containing (among other

things) semantic classes

Rapier rule matching example

“…sold to the bank for an undisclosed amount…”POS: vb pr det nn pr det jj nnSClass: price

“…paid Honeywell an undisclosed price…”POS: vb nnp det jj nnSClass: price

Rapier Rules: Details Rapier rule :=

pre-filler pattern filler pattern post-filler pattern

pattern := subpattern + subpattern := constraint + constraint :=

Word - exact word that must be present Tag - matched word must have given POS tag Class - semantic class of matched word Can specify disjunction with “{…}” List length N - between 0 and N words satisfying other

constraints

Rapier’s Learning Algorithm

Input: set of training examples (list of documents annotated with “extract this substring”)

Output: set of rules

Init: Rules = a rule that exactly matches each training example

Repeat several times: Seed: Select M examples randomly and generate the K

most-accurate maximally-general filler-only rules(prefiller = postfiller = “true”).

Grow:Repeat For N = 1, 2, 3, … Try to improve K best rules by adding N context words of prefiller or postfiller context

Keep:Rules = Rules the best of the K rules – subsumed rules

Learning example (one iteration)

2 examples:‘… located in Atlanta, Georgia…”

‘… offices in Kansas City, Missouri…’

maximally specific rules(high precision, low recall)

maximally general rules(low precision, high recall)

appropriately general rule (high precision, high recall)

Init

Seed

Grow

Rapier

Conceptually simple But powerful: it is easy to incorporate linguistic and

other constraints Advantages of rules

Easy to understand (but can be deceptive) One can “trace” the extraction of a piece of information

step by step Rules can be edited manually.

Disadvantages Difficult to incorporate quantitative / probabilistic

constraints “with offices in San Francisco in Northern California”

WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 14.

Documents

Transcript of WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 14.