ie.pdf

64
Information Extraction - Part I SNLP 2014 CSE, IIT Kharagpur October 27, 2014 SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 1 / 33

Transcript of ie.pdf

  • Information Extraction - Part I

    SNLP 2014

    CSE, IIT Kharagpur

    October 27, 2014

    SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 1 / 33

  • Goal: machine reading

    GoalAcquire structured information knowledge from unstructured text

    SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 2 / 33

  • Information Extraction

    Information Extraction (IE) SystemsFind and understand limited relevant parts of texts

    Gather information from many pieces of textProduce a structured representation of relevant information:

    I Relations (in the database sense)I A knowledge base

    GoalsOrganize information so that it is useful to people

    Put information in a semantically precise form that allows furtherinferences to be made by computer algorithms

    SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 3 / 33

  • Information Extraction

    Information Extraction (IE) SystemsFind and understand limited relevant parts of texts

    Gather information from many pieces of text

    Produce a structured representation of relevant information:I Relations (in the database sense)I A knowledge base

    GoalsOrganize information so that it is useful to people

    Put information in a semantically precise form that allows furtherinferences to be made by computer algorithms

    SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 3 / 33

  • Information Extraction

    Information Extraction (IE) SystemsFind and understand limited relevant parts of texts

    Gather information from many pieces of textProduce a structured representation of relevant information:

    I Relations (in the database sense)I A knowledge base

    GoalsOrganize information so that it is useful to people

    Put information in a semantically precise form that allows furtherinferences to be made by computer algorithms

    SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 3 / 33

  • Information Extraction

    Information Extraction (IE) SystemsFind and understand limited relevant parts of texts

    Gather information from many pieces of textProduce a structured representation of relevant information:

    I Relations (in the database sense)I A knowledge base

    GoalsOrganize information so that it is useful to people

    Put information in a semantically precise form that allows furtherinferences to be made by computer algorithms

    SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 3 / 33

  • Information Extraction

    Information Extraction (IE) SystemsFind and understand limited relevant parts of texts

    Gather information from many pieces of textProduce a structured representation of relevant information:

    I Relations (in the database sense)I A knowledge base

    GoalsOrganize information so that it is useful to people

    Put information in a semantically precise form that allows furtherinferences to be made by computer algorithms

    SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 3 / 33

  • Information Extraction (IE)

    DefinitionInformation extraction is the task of finding structured information fromunstructured or semi-structured text.

    What sort of information?IE Systems extract clear, factual information

    Roughly: Who did what to whom when? etc.

    E.g., Gathering earnings, profits, headquarters etc. from company reportsThe headquarters of BHP Billiton Limited, and the global headquarters ofthe combined BHP Billiton Group, are located in Melbourne, Australia.

    headquarters(BHP Biliton Limited, Melbourne, Australia)

    SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 4 / 33

  • Information Extraction (IE)

    DefinitionInformation extraction is the task of finding structured information fromunstructured or semi-structured text.

    What sort of information?

    IE Systems extract clear, factual information

    Roughly: Who did what to whom when? etc.

    E.g., Gathering earnings, profits, headquarters etc. from company reportsThe headquarters of BHP Billiton Limited, and the global headquarters ofthe combined BHP Billiton Group, are located in Melbourne, Australia.

    headquarters(BHP Biliton Limited, Melbourne, Australia)

    SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 4 / 33

  • Information Extraction (IE)

    DefinitionInformation extraction is the task of finding structured information fromunstructured or semi-structured text.

    What sort of information?IE Systems extract clear, factual information

    Roughly: Who did what to whom when? etc.

    E.g., Gathering earnings, profits, headquarters etc. from company reportsThe headquarters of BHP Billiton Limited, and the global headquarters ofthe combined BHP Billiton Group, are located in Melbourne, Australia.

    headquarters(BHP Biliton Limited, Melbourne, Australia)

    SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 4 / 33

  • Information Extraction (IE)

    DefinitionInformation extraction is the task of finding structured information fromunstructured or semi-structured text.

    What sort of information?IE Systems extract clear, factual information

    Roughly: Who did what to whom when? etc.

    E.g., Gathering earnings, profits, headquarters etc. from company reportsThe headquarters of BHP Billiton Limited, and the global headquarters ofthe combined BHP Billiton Group, are located in Melbourne, Australia.

    headquarters(BHP Biliton Limited, Melbourne, Australia)

    SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 4 / 33

  • Information Extraction (IE)

    DefinitionInformation extraction is the task of finding structured information fromunstructured or semi-structured text.

    What sort of information?IE Systems extract clear, factual information

    Roughly: Who did what to whom when? etc.

    E.g., Gathering earnings, profits, headquarters etc. from company reportsThe headquarters of BHP Billiton Limited, and the global headquarters ofthe combined BHP Billiton Group, are located in Melbourne, Australia.

    headquarters(BHP Biliton Limited, Melbourne, Australia)

    SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 4 / 33

  • Information Extraction (IE)

    DefinitionInformation extraction is the task of finding structured information fromunstructured or semi-structured text.

    What sort of information?IE Systems extract clear, factual information

    Roughly: Who did what to whom when? etc.

    E.g., Gathering earnings, profits, headquarters etc. from company reportsThe headquarters of BHP Billiton Limited, and the global headquarters ofthe combined BHP Billiton Group, are located in Melbourne, Australia.

    headquarters(BHP Biliton Limited, Melbourne, Australia)

    SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 4 / 33

  • Information Extraction (IE)

    ExampleIn 1998, Larry Page and Sergey Brin founded Google Inc.We can extract the following information,

    FounderOf(Larry Page, Google Inc.),

    FounderOf(SergeyBrin, Google Inc.),

    FoundedIn(Google Inc., 1998)

    Such information can be used by search engines and database managementsystems to provide better services to end users.

    SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 5 / 33

  • Information Extraction (IE)

    ExampleIn 1998, Larry Page and Sergey Brin founded Google Inc.

    We can extract the following information,

    FounderOf(Larry Page, Google Inc.),

    FounderOf(SergeyBrin, Google Inc.),

    FoundedIn(Google Inc., 1998)

    Such information can be used by search engines and database managementsystems to provide better services to end users.

    SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 5 / 33

  • Information Extraction (IE)

    ExampleIn 1998, Larry Page and Sergey Brin founded Google Inc.We can extract the following information,

    FounderOf(Larry Page, Google Inc.),

    FounderOf(SergeyBrin, Google Inc.),

    FoundedIn(Google Inc., 1998)

    Such information can be used by search engines and database managementsystems to provide better services to end users.

    SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 5 / 33

  • Information Extraction (IE)

    ExampleIn 1998, Larry Page and Sergey Brin founded Google Inc.We can extract the following information,

    FounderOf(Larry Page, Google Inc.),

    FounderOf(SergeyBrin, Google Inc.),

    FoundedIn(Google Inc., 1998)

    Such information can be used by search engines and database managementsystems to provide better services to end users.

    SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 5 / 33

  • Applications in Biomedical domain

    Biomedical domainA large amount of scientific publications

    Need to look for discoveries related to particular genes, proteins or otherbiomedical entities

    Biomedical entities often have synonyms and ambiguous names

    Critical task: automatically identify mentions of biomedical entities in textand link them to their corresponding entries in existing knowledge bases.

    SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 6 / 33

  • Biomedical domain

    SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 7 / 33

  • More applications of IE

    Building and extending knowledge bases and ontologies

    Scholarly literature databases: Google Scholar, CiteSeerX

    People directories: Rapleaf, Spoke, Naymz

    Bioinformatics: clinical outcomes, gene interactions, ...

    Stock analysis: deals, acquisitions, earnings, hirings and firings

    Intelligence analysis for business and government

    SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 8 / 33

  • More applications of IE

    Building and extending knowledge bases and ontologies

    Scholarly literature databases: Google Scholar, CiteSeerX

    People directories: Rapleaf, Spoke, Naymz

    Bioinformatics: clinical outcomes, gene interactions, ...

    Stock analysis: deals, acquisitions, earnings, hirings and firings

    Intelligence analysis for business and government

    SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 8 / 33

  • More applications of IE

    Building and extending knowledge bases and ontologies

    Scholarly literature databases: Google Scholar, CiteSeerX

    People directories: Rapleaf, Spoke, Naymz

    Bioinformatics: clinical outcomes, gene interactions, ...

    Stock analysis: deals, acquisitions, earnings, hirings and firings

    Intelligence analysis for business and government

    SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 8 / 33

  • More applications of IE

    Building and extending knowledge bases and ontologies

    Scholarly literature databases: Google Scholar, CiteSeerX

    People directories: Rapleaf, Spoke, Naymz

    Bioinformatics: clinical outcomes, gene interactions, ...

    Stock analysis: deals, acquisitions, earnings, hirings and firings

    Intelligence analysis for business and government

    SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 8 / 33

  • More applications of IE

    Building and extending knowledge bases and ontologies

    Scholarly literature databases: Google Scholar, CiteSeerX

    People directories: Rapleaf, Spoke, Naymz

    Bioinformatics: clinical outcomes, gene interactions, ...

    Stock analysis: deals, acquisitions, earnings, hirings and firings

    Intelligence analysis for business and government

    SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 8 / 33

  • More applications of IE

    Building and extending knowledge bases and ontologies

    Scholarly literature databases: Google Scholar, CiteSeerX

    People directories: Rapleaf, Spoke, Naymz

    Bioinformatics: clinical outcomes, gene interactions, ...

    Stock analysis: deals, acquisitions, earnings, hirings and firings

    Intelligence analysis for business and government

    SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 8 / 33

  • A Sample document containing a terrorism event

    Santiago, 10 Jan 90 - Police are carrying out intensive operations in the townof Molina in the seventh region in search of a gang of alleged extremists whocould be linked to a recently discovered arsenal. It has been reported thatCarabineros in Molina raided the house of of 25-year-old worker Mario MunozPardo, where they found a fal rifle, ammunition clips for various weapons,detonators, and material for making explosives. It should be recalled that agroup of armed individuals wearing ski masks robbed a businessman on arural road near Molina on 7 January. The businessman, Enrique OrmazabalOrmazabal, tried to resist; The men shot him and left him seriously wounded.He was later hospitalized in Curico. Carabineros carried out severaloperations, including the raid on Munoz home. The police are continuing topatrol the area in search of the alleged terrorist command.

    SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 9 / 33

  • Information Extraction as per the template

    SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 10 / 33

  • Relation Extraction

    SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 11 / 33

  • Relation types

    For generic news text ...

    SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 12 / 33

  • Relation types from ACE 2003

    ROLErelates a person to an organization or a geopolitical entitysubtypes: member, owner, affiliate, client, citizen

    PARTgeneralized containmentsubtypes: subsidiary, physical part-of, set membership

    ATpermanent and transient locationssubtypes: located, based-in, residence

    SOCIALsocial relations among personssubtypes: parent, sibling, spouse, grandparent, associate

    SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 13 / 33

  • Relation types from ACE 2003

    ROLErelates a person to an organization or a geopolitical entitysubtypes: member, owner, affiliate, client, citizen

    PARTgeneralized containmentsubtypes: subsidiary, physical part-of, set membership

    ATpermanent and transient locationssubtypes: located, based-in, residence

    SOCIALsocial relations among personssubtypes: parent, sibling, spouse, grandparent, associate

    SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 13 / 33

  • Relation types from ACE 2003

    ROLErelates a person to an organization or a geopolitical entitysubtypes: member, owner, affiliate, client, citizen

    PARTgeneralized containmentsubtypes: subsidiary, physical part-of, set membership

    ATpermanent and transient locationssubtypes: located, based-in, residence

    SOCIALsocial relations among personssubtypes: parent, sibling, spouse, grandparent, associate

    SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 13 / 33

  • Relation types from ACE 2003

    ROLErelates a person to an organization or a geopolitical entitysubtypes: member, owner, affiliate, client, citizen

    PARTgeneralized containmentsubtypes: subsidiary, physical part-of, set membership

    ATpermanent and transient locationssubtypes: located, based-in, residence

    SOCIALsocial relations among personssubtypes: parent, sibling, spouse, grandparent, associate

    SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 13 / 33

  • Relation types: Freebase

    SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 14 / 33

  • Relation types: geographical

    SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 15 / 33

  • More relations: disease outbreaks

    SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 16 / 33

  • More relations: protein interactions

    SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 17 / 33

  • Relation extraction: 5 easy methods

    Hand-built patterns

    Bootstrapping methods

    Supervised methods

    Distant supervision

    Unsupervised methods

    SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 18 / 33

  • Hand-written Information Extraction: use regex

    SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 19 / 33

  • You can also use Tregex

    Identifying Patterns in TreesSimple example: NP < NN

    SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 20 / 33

  • Tregex: Other relations

    SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 21 / 33

  • Rule-based Extraction Examples

    Determining which person holds what position in what organization

    [person], [position] of [org]Vuk Draskovic, leader of the Serbian Renewal Movement

    [org] (named, appointed, etc.) [person] Prep [office]NATO appointed Wesley Clark as Commander in Chief

    SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 22 / 33

  • Rule-based Extraction Examples

    Determining which person holds what position in what organization

    [person], [position] of [org]Vuk Draskovic, leader of the Serbian Renewal Movement

    [org] (named, appointed, etc.) [person] Prep [office]NATO appointed Wesley Clark as Commander in Chief

    SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 22 / 33

  • Rule-based Extraction Examples

    Determining which person holds what position in what organization

    [person], [position] of [org]Vuk Draskovic, leader of the Serbian Renewal Movement

    [org] (named, appointed, etc.) [person] Prep [office]NATO appointed Wesley Clark as Commander in Chief

    SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 22 / 33

  • Rule-based Extraction Examples

    Determining where an organization is located

    [org] in [loc]NATO headquarters in Brussels

    [org] [loc] (division, branch, headquarters, etc.)KFOR Kosovo headquarters

    SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 23 / 33

  • Rule-based Extraction Examples

    Determining where an organization is located

    [org] in [loc]NATO headquarters in Brussels

    [org] [loc] (division, branch, headquarters, etc.)KFOR Kosovo headquarters

    SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 23 / 33

  • Rule-based Extraction Examples

    Determining where an organization is located

    [org] in [loc]NATO headquarters in Brussels

    [org] [loc] (division, branch, headquarters, etc.)KFOR Kosovo headquarters

    SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 23 / 33

  • Patterns for learning hyponyms

    Intuition from Hearst (1992)Agar is a substance prepared from a mixture of red algae, such as Gelidium,for laboratory or industrial use.

    What is Gelidium?

    How do you know?

    SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 24 / 33

  • Patterns for learning hyponyms

    Intuition from Hearst (1992)Agar is a substance prepared from a mixture of red algae, such as Gelidium,for laboratory or industrial use.

    What is Gelidium?

    How do you know?

    SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 24 / 33

  • Patterns for learning hyponyms

    Intuition from Hearst (1992)Agar is a substance prepared from a mixture of red algae, such as Gelidium,for laboratory or industrial use.

    What is Gelidium?

    How do you know?

    SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 24 / 33

  • Hearsts lexico-syntactic patterns

    Automatic Acquisition of Hyponyms

    Y such as X((,X) (, and/or) X)such Y as X

    X or other Y

    X and other Y

    Y including X

    Y , especially X

    SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 25 / 33

  • Examples of Hearst patterns

    SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 26 / 33

  • Patterns for learning meronyms

    Berland and Charniaks patternsSelected initial patterns by finding all sentences in a corpus containingbasement and building

    SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 27 / 33

  • Patterns for learning meronyms

    Berland and Charniaks patternsSelected initial patterns by finding all sentences in a corpus containingbasement and building

    SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 27 / 33

  • Problems with hand-built patterns

    Requires hand-building patterns for each relation!I hard to write; hard to maintainI there are zillions of themI domain-dependent

    Dont want to do this for all possible relations!Plus, wed like better accuracy

    I Hearst: 66% accuracy on hyponym extractionI Berland and Charniak: 55% accuracy on meronyms

    SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 28 / 33

  • Bootstrapping approaches

    If you dont have enough annotated text to train on . . .But you do have:

    I some seed instances of the relationI (or some patterns that work pretty well)I and lots and lots of unannotated text (e.g., the web)

    can you use those seeds to do something useful?

    Bootstrapping can be considered semi-supervised

    SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 29 / 33

  • Bootstrapping example

    Target relation: burial place

    Seed tuple : [ Mark Twain, Elmira ]

    Google for Mark Twain and Elmira

    Use those patterns to search for new tuples

    SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 30 / 33

  • Bootstrapping example

    Target relation: burial place

    Seed tuple : [ Mark Twain, Elmira ]

    Google for Mark Twain and Elmira

    Use those patterns to search for new tuples

    SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 30 / 33

  • Bootstrapping example

    Target relation: burial place

    Seed tuple : [ Mark Twain, Elmira ]

    Google for Mark Twain and Elmira

    Use those patterns to search for new tuples

    SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 30 / 33

  • Bootstrapping example

    Target relation: burial place

    Seed tuple : [ Mark Twain, Elmira ]

    Google for Mark Twain and Elmira

    Use those patterns to search for new tuples

    SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 30 / 33

  • Bootstrapping example

    Target relation: burial place

    Seed tuple : [ Mark Twain, Elmira ]

    Google for Mark Twain and Elmira

    Use those patterns to search for new tuples

    SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 30 / 33

  • Bootstrapping relations

    SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 31 / 33

  • Bootstrapping problems

    Requires that we have seeds for each relationI Sensitive to original set of seeds

    Generally have lots of parameters to be tunedNo probabilistic interpretation

    I Hard to know how confident to be in each result

    SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 32 / 33

  • Supervised relation extraction

    The supervised approach requiresDefining an inventory of output labels

    I Relation classification: located-in, employee-of, inventor-of,

    Collecting labeled training data: MUC, ACE . . .

    Defining a feature representation: words, entity types, . . .

    Choosing a classifier: Nave Bayes, MaxEnt, SVM, . . .

    Evaluating the results

    SNLP 2014 (IIT Kharagpur) Information Extraction October 27, 2014 33 / 33