ie.pdf
-
Upload
amar-kaswan -
Category
Documents
-
view
6 -
download
1
Transcript of ie.pdf
-
Information Extraction - Part I
SNLP 2014
CSE, IIT Kharagpur
October 27, 2014
SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 1 / 33
-
Goal: machine reading
GoalAcquire structured information knowledge from unstructured text
SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 2 / 33
-
Information Extraction
Information Extraction (IE) SystemsFind and understand limited relevant parts of texts
Gather information from many pieces of textProduce a structured representation of relevant information:
I Relations (in the database sense)I A knowledge base
GoalsOrganize information so that it is useful to people
Put information in a semantically precise form that allows furtherinferences to be made by computer algorithms
SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 3 / 33
-
Information Extraction
Information Extraction (IE) SystemsFind and understand limited relevant parts of texts
Gather information from many pieces of text
Produce a structured representation of relevant information:I Relations (in the database sense)I A knowledge base
GoalsOrganize information so that it is useful to people
Put information in a semantically precise form that allows furtherinferences to be made by computer algorithms
SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 3 / 33
-
Information Extraction
Information Extraction (IE) SystemsFind and understand limited relevant parts of texts
Gather information from many pieces of textProduce a structured representation of relevant information:
I Relations (in the database sense)I A knowledge base
GoalsOrganize information so that it is useful to people
Put information in a semantically precise form that allows furtherinferences to be made by computer algorithms
SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 3 / 33
-
Information Extraction
Information Extraction (IE) SystemsFind and understand limited relevant parts of texts
Gather information from many pieces of textProduce a structured representation of relevant information:
I Relations (in the database sense)I A knowledge base
GoalsOrganize information so that it is useful to people
Put information in a semantically precise form that allows furtherinferences to be made by computer algorithms
SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 3 / 33
-
Information Extraction
Information Extraction (IE) SystemsFind and understand limited relevant parts of texts
Gather information from many pieces of textProduce a structured representation of relevant information:
I Relations (in the database sense)I A knowledge base
GoalsOrganize information so that it is useful to people
Put information in a semantically precise form that allows furtherinferences to be made by computer algorithms
SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 3 / 33
-
Information Extraction (IE)
DefinitionInformation extraction is the task of finding structured information fromunstructured or semi-structured text.
What sort of information?IE Systems extract clear, factual information
Roughly: Who did what to whom when? etc.
E.g., Gathering earnings, profits, headquarters etc. from company reportsThe headquarters of BHP Billiton Limited, and the global headquarters ofthe combined BHP Billiton Group, are located in Melbourne, Australia.
headquarters(BHP Biliton Limited, Melbourne, Australia)
SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 4 / 33
-
Information Extraction (IE)
DefinitionInformation extraction is the task of finding structured information fromunstructured or semi-structured text.
What sort of information?
IE Systems extract clear, factual information
Roughly: Who did what to whom when? etc.
E.g., Gathering earnings, profits, headquarters etc. from company reportsThe headquarters of BHP Billiton Limited, and the global headquarters ofthe combined BHP Billiton Group, are located in Melbourne, Australia.
headquarters(BHP Biliton Limited, Melbourne, Australia)
SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 4 / 33
-
Information Extraction (IE)
DefinitionInformation extraction is the task of finding structured information fromunstructured or semi-structured text.
What sort of information?IE Systems extract clear, factual information
Roughly: Who did what to whom when? etc.
E.g., Gathering earnings, profits, headquarters etc. from company reportsThe headquarters of BHP Billiton Limited, and the global headquarters ofthe combined BHP Billiton Group, are located in Melbourne, Australia.
headquarters(BHP Biliton Limited, Melbourne, Australia)
SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 4 / 33
-
Information Extraction (IE)
DefinitionInformation extraction is the task of finding structured information fromunstructured or semi-structured text.
What sort of information?IE Systems extract clear, factual information
Roughly: Who did what to whom when? etc.
E.g., Gathering earnings, profits, headquarters etc. from company reportsThe headquarters of BHP Billiton Limited, and the global headquarters ofthe combined BHP Billiton Group, are located in Melbourne, Australia.
headquarters(BHP Biliton Limited, Melbourne, Australia)
SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 4 / 33
-
Information Extraction (IE)
DefinitionInformation extraction is the task of finding structured information fromunstructured or semi-structured text.
What sort of information?IE Systems extract clear, factual information
Roughly: Who did what to whom when? etc.
E.g., Gathering earnings, profits, headquarters etc. from company reportsThe headquarters of BHP Billiton Limited, and the global headquarters ofthe combined BHP Billiton Group, are located in Melbourne, Australia.
headquarters(BHP Biliton Limited, Melbourne, Australia)
SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 4 / 33
-
Information Extraction (IE)
DefinitionInformation extraction is the task of finding structured information fromunstructured or semi-structured text.
What sort of information?IE Systems extract clear, factual information
Roughly: Who did what to whom when? etc.
E.g., Gathering earnings, profits, headquarters etc. from company reportsThe headquarters of BHP Billiton Limited, and the global headquarters ofthe combined BHP Billiton Group, are located in Melbourne, Australia.
headquarters(BHP Biliton Limited, Melbourne, Australia)
SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 4 / 33
-
Information Extraction (IE)
ExampleIn 1998, Larry Page and Sergey Brin founded Google Inc.We can extract the following information,
FounderOf(Larry Page, Google Inc.),
FounderOf(SergeyBrin, Google Inc.),
FoundedIn(Google Inc., 1998)
Such information can be used by search engines and database managementsystems to provide better services to end users.
SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 5 / 33
-
Information Extraction (IE)
ExampleIn 1998, Larry Page and Sergey Brin founded Google Inc.
We can extract the following information,
FounderOf(Larry Page, Google Inc.),
FounderOf(SergeyBrin, Google Inc.),
FoundedIn(Google Inc., 1998)
Such information can be used by search engines and database managementsystems to provide better services to end users.
SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 5 / 33
-
Information Extraction (IE)
ExampleIn 1998, Larry Page and Sergey Brin founded Google Inc.We can extract the following information,
FounderOf(Larry Page, Google Inc.),
FounderOf(SergeyBrin, Google Inc.),
FoundedIn(Google Inc., 1998)
Such information can be used by search engines and database managementsystems to provide better services to end users.
SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 5 / 33
-
Information Extraction (IE)
ExampleIn 1998, Larry Page and Sergey Brin founded Google Inc.We can extract the following information,
FounderOf(Larry Page, Google Inc.),
FounderOf(SergeyBrin, Google Inc.),
FoundedIn(Google Inc., 1998)
Such information can be used by search engines and database managementsystems to provide better services to end users.
SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 5 / 33
-
Applications in Biomedical domain
Biomedical domainA large amount of scientific publications
Need to look for discoveries related to particular genes, proteins or otherbiomedical entities
Biomedical entities often have synonyms and ambiguous names
Critical task: automatically identify mentions of biomedical entities in textand link them to their corresponding entries in existing knowledge bases.
SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 6 / 33
-
Biomedical domain
SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 7 / 33
-
More applications of IE
Building and extending knowledge bases and ontologies
Scholarly literature databases: Google Scholar, CiteSeerX
People directories: Rapleaf, Spoke, Naymz
Bioinformatics: clinical outcomes, gene interactions, ...
Stock analysis: deals, acquisitions, earnings, hirings and firings
Intelligence analysis for business and government
SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 8 / 33
-
More applications of IE
Building and extending knowledge bases and ontologies
Scholarly literature databases: Google Scholar, CiteSeerX
People directories: Rapleaf, Spoke, Naymz
Bioinformatics: clinical outcomes, gene interactions, ...
Stock analysis: deals, acquisitions, earnings, hirings and firings
Intelligence analysis for business and government
SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 8 / 33
-
More applications of IE
Building and extending knowledge bases and ontologies
Scholarly literature databases: Google Scholar, CiteSeerX
People directories: Rapleaf, Spoke, Naymz
Bioinformatics: clinical outcomes, gene interactions, ...
Stock analysis: deals, acquisitions, earnings, hirings and firings
Intelligence analysis for business and government
SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 8 / 33
-
More applications of IE
Building and extending knowledge bases and ontologies
Scholarly literature databases: Google Scholar, CiteSeerX
People directories: Rapleaf, Spoke, Naymz
Bioinformatics: clinical outcomes, gene interactions, ...
Stock analysis: deals, acquisitions, earnings, hirings and firings
Intelligence analysis for business and government
SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 8 / 33
-
More applications of IE
Building and extending knowledge bases and ontologies
Scholarly literature databases: Google Scholar, CiteSeerX
People directories: Rapleaf, Spoke, Naymz
Bioinformatics: clinical outcomes, gene interactions, ...
Stock analysis: deals, acquisitions, earnings, hirings and firings
Intelligence analysis for business and government
SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 8 / 33
-
More applications of IE
Building and extending knowledge bases and ontologies
Scholarly literature databases: Google Scholar, CiteSeerX
People directories: Rapleaf, Spoke, Naymz
Bioinformatics: clinical outcomes, gene interactions, ...
Stock analysis: deals, acquisitions, earnings, hirings and firings
Intelligence analysis for business and government
SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 8 / 33
-
A Sample document containing a terrorism event
Santiago, 10 Jan 90 - Police are carrying out intensive operations in the townof Molina in the seventh region in search of a gang of alleged extremists whocould be linked to a recently discovered arsenal. It has been reported thatCarabineros in Molina raided the house of of 25-year-old worker Mario MunozPardo, where they found a fal rifle, ammunition clips for various weapons,detonators, and material for making explosives. It should be recalled that agroup of armed individuals wearing ski masks robbed a businessman on arural road near Molina on 7 January. The businessman, Enrique OrmazabalOrmazabal, tried to resist; The men shot him and left him seriously wounded.He was later hospitalized in Curico. Carabineros carried out severaloperations, including the raid on Munoz home. The police are continuing topatrol the area in search of the alleged terrorist command.
SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 9 / 33
-
Information Extraction as per the template
SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 10 / 33
-
Relation Extraction
SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 11 / 33
-
Relation types
For generic news text ...
SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 12 / 33
-
Relation types from ACE 2003
ROLErelates a person to an organization or a geopolitical entitysubtypes: member, owner, affiliate, client, citizen
PARTgeneralized containmentsubtypes: subsidiary, physical part-of, set membership
ATpermanent and transient locationssubtypes: located, based-in, residence
SOCIALsocial relations among personssubtypes: parent, sibling, spouse, grandparent, associate
SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 13 / 33
-
Relation types from ACE 2003
ROLErelates a person to an organization or a geopolitical entitysubtypes: member, owner, affiliate, client, citizen
PARTgeneralized containmentsubtypes: subsidiary, physical part-of, set membership
ATpermanent and transient locationssubtypes: located, based-in, residence
SOCIALsocial relations among personssubtypes: parent, sibling, spouse, grandparent, associate
SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 13 / 33
-
Relation types from ACE 2003
ROLErelates a person to an organization or a geopolitical entitysubtypes: member, owner, affiliate, client, citizen
PARTgeneralized containmentsubtypes: subsidiary, physical part-of, set membership
ATpermanent and transient locationssubtypes: located, based-in, residence
SOCIALsocial relations among personssubtypes: parent, sibling, spouse, grandparent, associate
SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 13 / 33
-
Relation types from ACE 2003
ROLErelates a person to an organization or a geopolitical entitysubtypes: member, owner, affiliate, client, citizen
PARTgeneralized containmentsubtypes: subsidiary, physical part-of, set membership
ATpermanent and transient locationssubtypes: located, based-in, residence
SOCIALsocial relations among personssubtypes: parent, sibling, spouse, grandparent, associate
SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 13 / 33
-
Relation types: Freebase
SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 14 / 33
-
Relation types: geographical
SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 15 / 33
-
More relations: disease outbreaks
SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 16 / 33
-
More relations: protein interactions
SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 17 / 33
-
Relation extraction: 5 easy methods
Hand-built patterns
Bootstrapping methods
Supervised methods
Distant supervision
Unsupervised methods
SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 18 / 33
-
Hand-written Information Extraction: use regex
SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 19 / 33
-
You can also use Tregex
Identifying Patterns in TreesSimple example: NP < NN
SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 20 / 33
-
Tregex: Other relations
SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 21 / 33
-
Rule-based Extraction Examples
Determining which person holds what position in what organization
[person], [position] of [org]Vuk Draskovic, leader of the Serbian Renewal Movement
[org] (named, appointed, etc.) [person] Prep [office]NATO appointed Wesley Clark as Commander in Chief
SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 22 / 33
-
Rule-based Extraction Examples
Determining which person holds what position in what organization
[person], [position] of [org]Vuk Draskovic, leader of the Serbian Renewal Movement
[org] (named, appointed, etc.) [person] Prep [office]NATO appointed Wesley Clark as Commander in Chief
SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 22 / 33
-
Rule-based Extraction Examples
Determining which person holds what position in what organization
[person], [position] of [org]Vuk Draskovic, leader of the Serbian Renewal Movement
[org] (named, appointed, etc.) [person] Prep [office]NATO appointed Wesley Clark as Commander in Chief
SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 22 / 33
-
Rule-based Extraction Examples
Determining where an organization is located
[org] in [loc]NATO headquarters in Brussels
[org] [loc] (division, branch, headquarters, etc.)KFOR Kosovo headquarters
SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 23 / 33
-
Rule-based Extraction Examples
Determining where an organization is located
[org] in [loc]NATO headquarters in Brussels
[org] [loc] (division, branch, headquarters, etc.)KFOR Kosovo headquarters
SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 23 / 33
-
Rule-based Extraction Examples
Determining where an organization is located
[org] in [loc]NATO headquarters in Brussels
[org] [loc] (division, branch, headquarters, etc.)KFOR Kosovo headquarters
SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 23 / 33
-
Patterns for learning hyponyms
Intuition from Hearst (1992)Agar is a substance prepared from a mixture of red algae, such as Gelidium,for laboratory or industrial use.
What is Gelidium?
How do you know?
SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 24 / 33
-
Patterns for learning hyponyms
Intuition from Hearst (1992)Agar is a substance prepared from a mixture of red algae, such as Gelidium,for laboratory or industrial use.
What is Gelidium?
How do you know?
SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 24 / 33
-
Patterns for learning hyponyms
Intuition from Hearst (1992)Agar is a substance prepared from a mixture of red algae, such as Gelidium,for laboratory or industrial use.
What is Gelidium?
How do you know?
SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 24 / 33
-
Hearsts lexico-syntactic patterns
Automatic Acquisition of Hyponyms
Y such as X((,X) (, and/or) X)such Y as X
X or other Y
X and other Y
Y including X
Y , especially X
SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 25 / 33
-
Examples of Hearst patterns
SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 26 / 33
-
Patterns for learning meronyms
Berland and Charniaks patternsSelected initial patterns by finding all sentences in a corpus containingbasement and building
SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 27 / 33
-
Patterns for learning meronyms
Berland and Charniaks patternsSelected initial patterns by finding all sentences in a corpus containingbasement and building
SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 27 / 33
-
Problems with hand-built patterns
Requires hand-building patterns for each relation!I hard to write; hard to maintainI there are zillions of themI domain-dependent
Dont want to do this for all possible relations!Plus, wed like better accuracy
I Hearst: 66% accuracy on hyponym extractionI Berland and Charniak: 55% accuracy on meronyms
SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 28 / 33
-
Bootstrapping approaches
If you dont have enough annotated text to train on . . .But you do have:
I some seed instances of the relationI (or some patterns that work pretty well)I and lots and lots of unannotated text (e.g., the web)
can you use those seeds to do something useful?
Bootstrapping can be considered semi-supervised
SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 29 / 33
-
Bootstrapping example
Target relation: burial place
Seed tuple : [ Mark Twain, Elmira ]
Google for Mark Twain and Elmira
Use those patterns to search for new tuples
SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 30 / 33
-
Bootstrapping example
Target relation: burial place
Seed tuple : [ Mark Twain, Elmira ]
Google for Mark Twain and Elmira
Use those patterns to search for new tuples
SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 30 / 33
-
Bootstrapping example
Target relation: burial place
Seed tuple : [ Mark Twain, Elmira ]
Google for Mark Twain and Elmira
Use those patterns to search for new tuples
SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 30 / 33
-
Bootstrapping example
Target relation: burial place
Seed tuple : [ Mark Twain, Elmira ]
Google for Mark Twain and Elmira
Use those patterns to search for new tuples
SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 30 / 33
-
Bootstrapping example
Target relation: burial place
Seed tuple : [ Mark Twain, Elmira ]
Google for Mark Twain and Elmira
Use those patterns to search for new tuples
SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 30 / 33
-
Bootstrapping relations
SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 31 / 33
-
Bootstrapping problems
Requires that we have seeds for each relationI Sensitive to original set of seeds
Generally have lots of parameters to be tunedNo probabilistic interpretation
I Hard to know how confident to be in each result
SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 32 / 33
-
Supervised relation extraction
The supervised approach requiresDefining an inventory of output labels
I Relation classification: located-in, employee-of, inventor-of,
Collecting labeled training data: MUC, ACE . . .
Defining a feature representation: words, entity types, . . .
Choosing a classifier: Nave Bayes, MaxEnt, SVM, . . .
Evaluating the results
SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 33 / 33