Introduction to Information Retrieval Introducing Information Retrieval and Web Search.
WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 14.
-
Upload
miles-lester -
Category
Documents
-
view
216 -
download
2
Transcript of WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 14.
WEB BAR 2004 Advanced Retrieval and Web Mining
Lecture 14
Today’s Topic: Information Extraction
Information extraction: Intro Hidden markov models Evaluation and outlook
Text classification vs. information extraction
?TC
IE
Information Extraction: Definition
Given Unstructured text
or slightly structured such as html A template with “slots”
Common slots: author, date, location, company
Information extraction task Analyze document Fill template slots with values extracted
from document Author: Smith, date: 30. Aug, location: Rome,
company: IBM etc.
Classified Advertisements (Real Estate)
Background: Advertisements
are plain text Text is lowest
common denominator: only thing that 70+ newspapers with 20+ publishing systems can all handle
<ADNUM>2067206v1</ADNUM><DATE>March 02, 1998</DATE><ADTITLE>MADDINGTON
$89,000</ADTITLE><ADTEXT>OPEN 1.00 - 1.45<BR>U 11 / 10 BERTRAM ST<BR> NEW TO MARKET Beautiful<BR>3 brm freestanding<BR>villa, close to shops & bus<BR>Owner moved to Melbourne<BR> ideally suit 1st home buyer,<BR> investor & 55 and over.<BR>Brian Hazelden 0418 958 996<BR> R WHITE LEEMING 9332 3477</ADTEXT>
Extracting Job Openings from the Web
foodscience.com-Job2
JobTitle: Ice Cream Guru
Employer: foodscience.com
JobCategory: Travel/Hospitality
JobFunction: Food Services
JobLocation: Upper Midwest
Contact Phone: 800-488-2611
DateExtracted: January 8, 2001
Source: www.foodscience.com/jobs_midwest.html
OtherCompanyJobs: foodscience.com-Job1
‘Change of Address’ email
Modify addressbook, etc.
• For email messages that communicate a change of email address:• Automatically extract the new email
Product information
Product info
This is valuable information that is sold by some companies
How do they get most of it?
Phone calls Typing
Other applications of IE Systems
Job resumes: BurningGlass, Mohomine Seminar announcements Molecular biology information from MEDLINE, e.g,
Extracting gene drug interactions from biomed texts Summarizing medical patient records by extracting
diagnoses, symptoms, physical findings, test results. Gathering earnings, profits, board members, etc.
[corporate information] from web, company reports Verification of construction industry specifications
documents (are the quantities correct/reasonable?) Extraction of political/economic/business changes
from newspaper articles
Why doesn’t text search (IR) work?
What you search for in real estate advertisements:
Location: which Suburb. You might think easy, but: Suburb not mentioned Phrases: Only 45 minutes from Parramatta Multiple properties in different suburbs in one ad
Money: want a range not a textual match Multiple amounts: was $155K, now $145K Variations: offers in the high 700s [but not rents
for $270] Bedrooms: similar issues (br, bdr, beds, B/R)
Why doesn’t text search (IR) work?
Image Capture Device: 1.68 million pixel 1/2-inch CCD sensor
Image Capture Device Total Pixels Approx. 3.34 million Effective Pixels Approx. 3.24 million
Image sensor Total Pixels: Approx. 2.11 million-pixel Imaging sensor Total Pixels: Approx. 2.11 million 1,688
(H) x 1,248 (V) CCD Total Pixels: Approx. 3,340,000 (2,140[H] x 1,560
[V] ) Effective Pixels: Approx. 3,240,000 (2,088 [H] x 1,550 [V] ) Recording Pixels: Approx. 3,145,000 (2,048 [H] x 1,536 [V] )
These all came off the same manufacturer’s website!!
And this is a very technical domain. Try sofa beds.
Task: Information Extraction
Goal: being able to answer semantic queries (a.k.a. “database queries”) using “unstructured” natural language sources
Identify specific pieces of information in an un-structured or semi-structured textual document.
Transform this unstructured information into structured relations in a database/ontology.
Suppositions: A lot of information that could be represented in
a structured semantically clear format isn’t It may be costly, not desired, or not in one’s
control (screen scraping) to change this.
Amazon Book Description….</td></tr></table><b class="sans">The Age of Spiritual Machines : When Computers Exceed Human Intelligence</b><br><font face=verdana,arial,helvetica size=-1>by <a href="/exec/obidos/search-handle-url/index=books&field-author= Kurzweil%2C%20Ray/002-6235079-4593641">Ray Kurzweil</a><br></font><br><a href="http://images.amazon.com/images/P/0140282025.01.LZZZZZZZ.jpg"><img src="http://images.amazon.com/images/P/0140282025.01.MZZZZZZZ.gif" width=90 height=140 align=left border=0></a><font face=verdana,arial,helvetica size=-1><span class="small"><span class="small"><b>List Price:</b> <span class=listprice>$14.95</span><br><b>Our Price: <font color=#990000>$11.96</font></b><br><b>You Save:</b> <font color=#990000><b>$2.99 </b>(20%)</font><br></span><p> <br>
….</td></tr></table><b class="sans">The Age of Spiritual Machines : When Computers Exceed Human Intelligence</b><br><font face=verdana,arial,helvetica size=-1>by <a href="/exec/obidos/search-handle-url/index=books&field-author= Kurzweil%2C%20Ray/002-6235079-4593641">Ray Kurzweil</a><br></font><br><a href="http://images.amazon.com/images/P/0140282025.01.LZZZZZZZ.jpg"><img src="http://images.amazon.com/images/P/0140282025.01.MZZZZZZZ.gif" width=90 height=140 align=left border=0></a><font face=verdana,arial,helvetica size=-1><span class="small"><span class="small"><b>List Price:</b> <span class=listprice>$14.95</span><br><b>Our Price: <font color=#990000>$11.96</font></b><br><b>You Save:</b> <font color=#990000><b>$2.99 </b>(20%)</font><br></span><p> <br>…
Extracted Book TemplateTitle: The Age of Spiritual Machines : When Computers Exceed Human IntelligenceAuthor: Ray KurzweilList-Price: $14.95Price: $11.96::
Web Extraction
Many web pages are generated automatically from an underlying database.
Therefore, the HTML structure of pages is fairly specific and regular (semi-structured).
However, output is intended for human consumption, not machine interpretation.
An IE system for such generated pages allows the web site to be viewed as a structured database.
An extractor for a semi-structured web site is sometimes referred to as a wrapper.
Process of extracting from such pages is sometimes referred to as screen scraping.
Simple Extraction Patterns
Specify an item to extract for a slot using a regular expression pattern.
Price pattern: “\b\$\d+(\.\d{2})?\b” May require preceding (pre-filler) pattern to
identify proper context. Amazon list price:
Pre-filler pattern: “<b>List Price:</b> <span class=listprice>” Filler pattern: “\$\d+(\.\d{2})?\b”
May require succeeding (post-filler) pattern to identify the end of the filler.
Amazon list price: Pre-filler pattern: “<b>List Price:</b> <span class=listprice>” Filler pattern: “.+” Post-filler pattern: “</span>”
Simple Template Extraction
Extract slots in order, starting the search for the filler of the n+1 slot where the filler for the nth slot ended. Assumes slots always in a fixed order.
Title Author List price …
Make patterns specific enough to identify each filler always starting from the beginning of the document.
Pre-Specified Filler Extraction
If a slot has a fixed set of pre-specified possible fillers, text categorization can be used to fill the slot. Job category Company type
Treat each of the possible values of the slot as a category, and classify the entire document to determine the correct filler.
Natural Language Processing
If extracting from automatically generated web pages, simple regex patterns usually work.
If extracting from more natural, unstructured, human-written text, some NLP may help.
Part-of-speech (POS) tagging Mark each word as a noun, verb, preposition, etc.
Syntactic parsing Identify phrases: NP, VP, PP
Semantic word categories (e.g. from WordNet) KILL: kill, murder, assassinate, strangle, suffocate
Extraction patterns can use POS or phrase tags. Crime victim:
Prefiller: [POS: V, Hypernym: KILL] Filler: [Phrase: NP]
Learning for IE
Writing accurate patterns for each slot for each domain (e.g. each web site) requires laborious software engineering.
Alternative is to use machine learning: Build a training set of documents paired with human-
produced filled extraction templates. Learn extraction patterns for each slot using an
appropriate machine learning algorithm. Rapier system learns three regex-style patterns for
each slot: Pre-filler pattern Filler pattern Post-filler pattern
Evaluating IE Accuracy Always evaluate performance on independent,
manually-annotated test data not used during system development.
Measure for each test document: Total number of correct extractions in the solution
template: N Total number of slot/value pairs extracted by the
system: E Number of extracted slot/value pairs that are correct
(i.e. in the solution template): C Compute average value of metrics adapted from IR:
Recall = C/N Precision = C/E F-Measure = Harmonic mean of recall and precision
XML and IE
If relevant documents were all available in standardized XML format, IE would be unnecessary.
But… Difficult to develop a universally adopted DTD format
for the relevant domain. Difficult to manually annotate documents with
appropriate XML tags. Commercial industry may be reluctant to provide data
in easily accessible XML format. IE provides a way of automatically transforming semi-
structured or unstructured data into an XML compatible format.
Web Extraction using DOM Trees
Web extraction may be aided by first parsing web pages into DOM trees.
Extraction patterns can then be specified as paths from the root of the DOM tree to the node containing the text to extract.
May still need regex patterns to identify proper portion of the final CharacterData node.
Sample DOM Tree ExtractionHTML
BODY
FONTB
Age of Spiritual Machines
Ray Kurzweil
Element
Character-DataHEADER
by A
Title: HTMLBODYBCharacterDataAuthor: HTML BODYFONTA CharacterData
Shop Bots
One application of web extraction is automated comparison shopping systems.
System must be able to extract information on items (product specs and prices) from multiple web stores.
User queries a single site, which integrates information extracted from multiple web stores and presents overall results to user in a uniform format, e.g. ordered by price.
Several commercial systems: MySimon Cnet BookFinder
Shop Bots (cont.)
Construct wrapper for each source web store.Accept shopping query from user.For each source web store: Submit query to web store. Obtain resulting HTML page. Extract information from page and store in local DB.Sort items in resulting DB by price.Format results into HTML and return result.
Alternative is to extract information from all web stores in advance and store in a uniform global DB for subsequent query processing.
Information Integration
Answering certain questions using the web requires integrating information from multiple web sites.
Information integration concerns methods for automating this integration.
Requires wrappers to accurately extract specific information from web pages from specific sites.
Treat each wrapped site as a database table and answer complex queries using a database query language (e.g. SQL).
Information Integration Example
Question: What is the closest theater to my home where I can see both Monsters Inc. and Harry Potter?
From austin360.com, extract theaters and their addresses where Harry Potter and Monster’s Inc. are playing.
Intersect the two to find the theaters playing both. Query mapquest.com for driving directions from your
home address to the address of each of these theaters.
Extract distance and driving instructions for each. Sort results by driving distance. Present driving instructions for closest theater.
Knowledge Extraction VisionMulti-
dimensional Meta-data Extraction J F M A M J J A
EMPLOYEE / EMPLOYER Relationships:Jan Clesius works for Clesius EnterprisesBill Young works for InterMedia Inc.COMPANY / LOCATION Relationshis:Clesius Enterprises is in New York, NYInterMedia Inc. is in Boston, MA
Meta-Data
India Bombing NY Times Andhra Bhoomi Dinamani Dainik Jagran
Topic Discovery
Concept Indexing
Thread Creation
Term Translation
Document Translation
Story Segmentation
Entity Extraction
Fact Extraction
Knowledge Extraction Vision The vision is to have relational metadata
associated with each document / web page. A search engine could then combine the power of
information retrieval with the power of an RDBMS. Some search engine queries that this would
enable: Find web pages about John Russ Find web pages about books authored by Queen
Elizabeth Find web pages about the person who assassinated
Lee Harvey Oswald Exercise: Why are these hard queries for current
search engine technology?
HMMs
What is an HMM?
Graphical Model Representation: Variables by time Circles indicate states Arrows indicate probabilistic dependencies
between states
What is an HMM?
Green circles are hidden states Dependent only on the previous state: Order-1
Markov process “The past is independent of the future given the
present.”
What is an HMM?
Purple nodes are observed states Dependent only on their corresponding hidden
state
HMM Formalism
{S, K, are the initial state probabilities A = {aij} are the state transition probabilities B = {bik} are the observation state probabilities
A
B
AAA
BB
SSS
KKK
S
K
S
K
Applying HMMs to IE
Document generated by a stochastic process modelled by an HMM
Token word State “reason/explanation” for a given token
‘Background’ state emits tokens like ‘the’, ‘said’, … ‘Money’ state emits tokens like ‘million’, ‘euro’, … ‘Organization’ state emits tokens like ‘university’,
‘company’, … Extraction: via the Viterbi algorithm, a dynamic
programming technique for efficiently computing the most likely sequence of states that generated a document.
Create Bibliographic Entry
Leslie Pack Kaelbling, Michael L. Littman
and Andrew W. Moore. Reinforcement
Learning: A Survey. Journal of Artificial
Intelligence Research, pages 237-285,
May 1996.
Bibliographic Entry Headers
HMM for research papers: emissions [Seymore et al., 99]
author title institution
Trained on 2 million words of BibTeX data from the Web
...note
ICML 1997...submission to…to appear in…
stochastic optimization...reinforcement learning…model building mobile robot...
carnegie mellon university…university of californiadartmouth college
supported in part…copyright...
HMM for research papers: transitions [Seymore et al., 99]
Inference for an HMM
Analyze a sequence: Compute the probability of a given observation sequence
Applying the model: Given an observation sequence, compute the most likely hidden state sequence
Learning the model: Given an observation sequence and set of possible models, which model most closely fits the data?
)|( Compute
),,( ,),...,( 1
OP
BAooO T
oTo1 otot-1 ot+1
Given an observation sequence and a model, compute the probability of the observation sequence
Sequence Probability
Learning HMMs Good news: If training data tokens are tagged with their
generating states, then simple frequency ratios are a maximum-likelihood estimate of transition/emission probabilities. Easy. (Use smoothing to avoid zero probs for emissions/transitions absent in the training data.)
Great news: Baum-Welch algorithm trains an HMM using partially labeled or unlabelled training data.
Bad news: How many states should the HMM contain? How are transitions constrained?
Only semi-good answers to finding answer automatically Insufficiently expressive Unable to model important
distinctions (long distance correlations, other features) Overly expressive sparse training data, overfitting
Learning: Supervised vs Unsupervised
Supervised If you have a training set Computation of parameters is simple (but
need to use smoothing) Unsupervised
EM / Forward-Backward Usually need to start from a model trained
in a supervised manner Unlabeled data can further improve a good
model
Statistical generative models
Rapier uses explicit extraction patterns/rules Hidden Markov Models are a powerful alternative based
on statistical token sequence generation models rather than explicit extraction patterns.
Pros: Well-understood underlying statistical model makes it easy
to use wide range of tools from statistical decision theory Portable, broad coverage, robust, good recall
Cons: Range of features and patterns usable is limited
Memory of 1 for customarily used HMMs Not necessarily as good for complex multi-slot patterns
More About Information Extraction
Three generations of IE systems
Hand-Built Systems – Knowledge Engineering [1980s– ] Rules written by hand Require experts who understand both the systems and the
domain Iterative guess-test-tweak-repeat cycle
Automatic, Trainable Rule-Extraction Systems [1990s– ] Rules discovered automatically using predefined templates
and methods like ILP Require huge, labeled corpora (effort is just moved!)
Statistical Generative Models [1997 – ] One decodes the statistical model to find which bits of the
text were relevant, using HMMs or statistical parsers Learning usually supervised; may be partially unsupervised
Trainable IE systems
Pros Annotating text is
simpler & faster than writing rules.
Domain independent Domain experts don’t
need to be linguists or programmers.
Learning algorithms ensure full coverage of examples.
Cons Hand-crafted systems
perform better, especially at hard tasks.
Training data might be expensive to acquire
May need huge amount of training data
Hand-writing rules isn’t that hard!!
MUC: the genesis of IE
DARPA funded significant efforts in IE in the early to mid 1990’s.
Message Understanding Conference (MUC) was an annual event/competition where results were presented.
Focused on extracting information from news articles:
Terrorist events Industrial joint ventures Company management changes
Information extraction of particular interest to the intelligence community (CIA, NSA).
Natural Language Processing If extracting from automatically generated web
pages, simple regex patterns usually work. If extracting from more natural, unstructured,
human-written text, some NLP helps. Part-of-speech (POS) tagging
Mark each word as a noun, verb, preposition, etc. Syntactic parsing
Identify phrases: NP, VP, PP Semantic word categories (e.g. from WordNet)
PRICE: price, amount, cost, … Extraction patterns can use POS or phrase tags.
Company location Prefiller: [POS: IN, Word: “in”] Postfiller: [POS:, Semantic Category: US State]
What about XML? Don’t XML, RDF, OIL, SHOE, DAML, XSchema, the
Semantic Web … obviate the need for information extraction?!??!
Yes: IE is sometimes used to “reverse engineer” HTML
database interfaces; extraction would be much simpler if XML were exported instead of HTML.
Ontology-aware editors will make it easier to enrich content with metadata.
No: Terabytes of legacy HTML. Data consumers forced to accept ontological decisions
of data providers (eg, <NAME>John Smith</NAME> vs.<NAME first="John" last="Smith"/> ).
Will you annotate every email you send? Every memo you write? Every photograph you scan?
Evaluating IE Accuracy
Always evaluate performance on independent, manually-annotated test data not used during system development.
Example: extract job ads from web Measure for each test document:
Total number of job titles occurring in the test data: N Total number of job title candidates extracted by the
system: E Number of extracted job titles that are correct (i.e. in the
solution template): C Compute average value of metrics adapted from IR:
Recall = C/N Precision = C/E F-Measure = Harmonic mean of recall and precision
MUC Information Extraction:State of the Art c. 1997
NE – named entity recognitionCO – coreference resolutionTE – template element constructionTR – template relation constructionST – scenario template production
Take Away
Information extraction (IE) vs Text classification Text classification assigns a class to entire doc Information extraction
Extracts phrases from a document Classifies the function of the phrase (author, title,…)
We’ve looked at the “fragment extraction” task. Future?
Better ways of using domain knowledge More NLP, e.g. syntactic parsing
Information extraction beyond fragment extraction: Anaphora resolution, discourse processing, ... Fragment extraction is good enough for many Web
information services!
Take Away (2)
Learning IE extractors with HMMs HMMs are generative models Training: Forward/Backward; Application:
Viterbi HMMs can be trained on unlabeled text In practice, labeled text is usually needed Indirect / partial labels are key HMM topology can also be learned
Applications: What exactly is IE good for? Is there a use for today’s “60%” results?
67% in recent KDD Cup 90% accurate IE could be a revolutionizing
technology
Good Basic IE References
Douglas E. Appelt and David Israel. 1999. Introduction to Information Extraction Technology. IJCAI 1999 Tutorial. http://www.ai.sri.com/~appelt/ie-tutorial/.
Kushmerick, Weld, Doorenbos: Wrapper Induction for Information Extraction,IJCAI 1997. http://www.cs.ucd.ie/staff/nick/.
Stephen Soderland: Learning Information Extraction Rules for Semi-Structured and Free Text. Machine Learning 34(1-3): 233-272 (1999)
More References
Mary Elaine Califf and Raymond J. Mooney: Relational Learning of Pattern-Match Rules for Information Extraction. In AAAI 1999: 328-334.
Leek, T. R. 1997, Information Extraction using Hidden Markov Models, Master’s thesis, UCSD
Bikel, D. M.; Miller, S; Schwartz, R.; and Weischedel, R. 1997, Nymble: a high-performance learning name-finder. In Proceedings of ANLP-97, 194-201. [Also in MLJ 1999]
Kristie Seymore, Andrew McCallum, Ronald Rosenfeld, 1999, Learning Hidden Markov Model Structure for Information Extraction, In Proceedings if the AAAI-99 Workshop on ML for IE.
Dayne Freitag and Andrew McCallum, 2000, Information Extraction with HMM Structures Learned by Stochastic Optimization. AAAI-2000.
Rapier: A Rule-Based System
Machine Learning Approach
Motivation: Writing accurate patterns for each slot for each domain (e.g. each web site) requires laborious software engineering.
Alternative is to use machine learning: Build a training set of documents paired
with human-produced filled extraction templates.
Learn extraction patterns for each slot using an appropriate machine learning algorithm.
Automatic Pattern-Learning Systems
Pros: Portable across domains Tend to have broad coverage Automatically finds appropriate patterns System knowledge not needed by those who supply the
domain knowledge. Cons:
We need annotated training data, and lots of it Isn’t necessarily better or cheaper than hand-built sol’n
Examples: Riloff et al., AutoSlog (UMass); Soderland WHISK (UMass); Mooney et al. Rapier (UTexas):
learn lexico-syntactic patterns from templates
Trainer
Decoder
Model
LanguageInput
Answers
AnswersLanguageInput
Rapier [Califf & Mooney, AAAI-99]
Rapier learns three regex-style patterns for each slot:Pre-filler pattern Filler pattern Post-filler pattern
One of several recent trainable IE systems that incorporate linguistic constraints. (See also: SIFT [Miller et al, MUC-7]; SRV [Freitag, AAAI-98]; Whisk [Soderland, MLJ-99].)
RAPIER rules for extracting “transaction price”
“…paid $11M for the company…”“…sold to the bank for an undisclosed
amount…”“…paid Honeywell an undisclosed price…”
Part-of-speech tags & Semantic classes
Part of speech: syntactic role of a specific word noun (nn), proper noun (nnp), adjectve (jj), adverb (rb),
determiner (dt), verb (vb), “.” (“.”), … NLP: Well-known algorithms for automatically assigning POS
tags to English, French, Japanese, … (>95% accuracy)
Semantic Classes: Synonyms or other related words
“Price” class: price, cost, amount, … “Month” class: January, February, March, …, December “US State” class: Alaska, Alabama, …, Washington,
Wyoming WordNet: large on-line thesaurus containing (among other
things) semantic classes
Rapier rule matching example
“…sold to the bank for an undisclosed amount…”POS: vb pr det nn pr det jj nnSClass: price
“…paid Honeywell an undisclosed price…”POS: vb nnp det jj nnSClass: price
Rapier Rules: Details Rapier rule :=
pre-filler pattern filler pattern post-filler pattern
pattern := subpattern + subpattern := constraint + constraint :=
Word - exact word that must be present Tag - matched word must have given POS tag Class - semantic class of matched word Can specify disjunction with “{…}” List length N - between 0 and N words satisfying other
constraints
Rapier’s Learning Algorithm
Input: set of training examples (list of documents annotated with “extract this substring”)
Output: set of rules
Init: Rules = a rule that exactly matches each training example
Repeat several times: Seed: Select M examples randomly and generate the K
most-accurate maximally-general filler-only rules(prefiller = postfiller = “true”).
Grow:Repeat For N = 1, 2, 3, … Try to improve K best rules by adding N context words of prefiller or postfiller context
Keep:Rules = Rules the best of the K rules – subsumed rules
Learning example (one iteration)
2 examples:‘… located in Atlanta, Georgia…”
‘… offices in Kansas City, Missouri…’
maximally specific rules(high precision, low recall)
maximally general rules(low precision, high recall)
appropriately general rule (high precision, high recall)
Init
Seed
Grow
Rapier
Conceptually simple But powerful: it is easy to incorporate linguistic and
other constraints Advantages of rules
Easy to understand (but can be deceptive) One can “trace” the extraction of a piece of information
step by step Rules can be edited manually.
Disadvantages Difficult to incorporate quantitative / probabilistic
constraints “with offices in San Francisco in Northern California”