Information Extraction from Literature Yue Lu BeeSpace Seminar Oct 24, 2007.

33
Information Extraction from Literature Yue Lu BeeSpace Seminar Oct 24, 2007

Transcript of Information Extraction from Literature Yue Lu BeeSpace Seminar Oct 24, 2007.

Information Extraction from Literature

Yue LuBeeSpace SeminarOct 24, 2007

Outline

Overview of BeeSpace v4 Entity Recognition Relation Extraction

Overview

BeeSpace V4 deeper semantic base than the current v3 system entities and relations VS mutual information

Four levels Level1: Entity Recognition Level2: Entity Association Mining Level3: Relation Extraction Level4: Inference and Hypothesis Generation

Overview

Level1: Entity Recognition (detailed later) Level2 Entity Association Mining

Suppose entities are properly taggedUtilize the co-occurrence patterns of entities

to extract semanticse.g. a bee biologist may want to know which

genes are important for foraging behavior. Similar to TREC Genomics 2007 task

TREC Genomics 2007

e.g. “Which [PATHWAYS] are possibly involved in the disease ADPKD?”

currently only retrieval techniquesGene synonym expansionConjunctive query interpretationUser relevance feedback

tagged Entities definitely would help

Overview

Level3: Relation Extraction Goal is to extract the relations between entities Generally requires entities to be properly tagged first Detailed later

Level4: Inference and Hypothesis Generation Inference on knowledge base Graph mining

Outline

Overview of BeeSpace v4 Entity Recognition Relation Extraction

Entity Recognition

Gene Example:

Although <GENE>mxp</GENE> and <GENE>Pb</GENE> display very similar expression patterns, <GENE>pb</GENE> null embryos develop normally

Entity Recognition

Anatomy Example:

In normal embryos, mxp is expressed in the <ANATOMY>maxillary</ANATOMY> and <ANATOMY>labial</ANATOMY> segments, whereas ectopic expression is observed in some GOF variants.

Entity Recognition

Biological process Example:

Amongst these are the Bicoid, the Nanos, and the terminal class gene products, some of which are oncoproteins involved in signal transduction for <BIOLOGICAL PROCESS>the formation of terminal structures in the embryo<BIOLOGICAL PROCESS>.

Entity Recognition

Pathways Example:

Several signal transduction pathways have been described in Drosophila, and this review explores the potential of oncogene studies using one of those pathways - <PATHWAY>the terminal class signal transduction pathway</PATHWAY> - to better understand the cellular mechanisms of proto-oncogenes that mediate cellular responses in vertebrates including humans

Entity Recognition

Protein family Example:

While non-arthropod orthologs have been found for many Drosophila eye developmental genes, this has not been the case for the glass (gl) gene, which encodes a <PROTEIN FAMILY>zinc finger transcription factor</PROTEIN FAMILY> required for photoreceptor cell specification, differentiation, and survival.

Entity Recognition

CRE (cis-regulatory elements) Example:

A synthetic, 23-bp <CRE>ecdysterone regulatory element (EcRE) </CRE>, derived from the upstream region of the Drosophila melanogaster hsp27 gene, was inserted adjacent to the herpes simplex virus thymidine kinase promoter fused to a bacterial gene for chloramphenicol acetyltransferase (CAT).

Entity Recognition

Phenotype Definition:

a set of observable physical characteristics of an individual organism

Example: Fog, dumpy

Entity Recognition

Class1: Small Variation (Dictionary/Ontology)Organism, Anatomy , Biological Process,

Pathway, Protein Family Class2: Medium Variation

Gene, cis Regulatory Element Class3: Large Variation

Phenotype, Behavior

Entity Recognition

Generally can be defined as a classification problem

Boils down to feature definitionClass1: matching a word in the

Dictionary/Ontology Class2: prefix/suffix of the word, POS tags, …Class3:?

Entity Recognition

Firstly focus on Class1Relatively simple

Class2 and Class3 need training examples

Useful in entity association miningUseful in facilitating extraction of many

interesting relations Related work: Textpresso

Textpresso

Input: full text C. elegans literature Output: tagged XML format Defined a Textpresso ontology

First category is biological entities

manually curated a lexicon of names Implemented by PERL regular expressions We could reuse some of the regular expressions

Entity Recognition

Organism Entrez gene table, Textpresso, BeeSpace DB

Anatomy FlyBase

Biological Process,

Cellular Component, Molecular Function

Textpresso

Pathway KEGG

Protein Family PDB, NCBI

Resources:

Outline

Overview of BeeSpace v4 Entity Recognition Relation Extraction

Relation Extraction

Expression Location the expression of a gene in some location

(tissues, body parts) Homology/Orthology

one gene is homologous to another gene

Relation Extraction

Biological process one gene has some role in a biological

process Genetic/Physical/Regulatory Interaction

one gene interacts with another gene in a certain fashion (3 types of relations)

a simple case: Protein-Protein Interaction (PPI)

Relation Extraction

Generally can be defined as a classification problem, which requires training data

Domain adaptation?an example of PPI

PPI

Problem Definition:Gene/protein names are already taggedA known list of interaction words

133 words

classify each tuple (p1, p2, interWord) in one single sentence

PPI

MethodsLearning algorithm: Maximum EntropyContext features

“Extracting protein-protein interactions using simple contextual features training data” BioNLP Workshop on HLT-NAACL 06

e.g. lexical forms, POS tags … Less dependent on domain

PPI

Training/Testing data:BioCreative1000 hand labeled sentences, 3964 tuples5-fold cross validation

Performanceavgpr = 47.14624avgre = 43.97337avgf1 = 45.35523

PPI

Training data: BioCreative 1000 hand labeled sentences, 3964 tuples

Testing Data (different domain) Bee collection

Performance (Judged by Moushumi) Total number of tuples extracted as PPI instances: 92 Precision: 63%

PPI Misclassification examples

Type1: No interaction Sentence: Pretreatment of platelet suspension

with phospholipase A2 from N. naja atra or A. mellifera venom (50 .mu.g/ml) inhibited platelet aggregation induced by sodium arachidonate or collagen, but not induced by thrombin or ionophore A-23187.

False: (collagen, thrombin, induced) True: relation between protein and platelet

aggregation; no PPI

PPI Misclassification examples

Type2: Incorrect interaction word Sentence: IgG antibody was able to inhibit

binding of IgE antibody in the PLA radioallergsorbent test (RAST) from 10-40% at a molar excess of 10- to 1000-fold.

False: (IgG antibody, IgE antibody, binding) True: (IgG antibody, IgE antibody, inhibit)

PPI Misclassification examples

Type3: Incorrect protein involved Sentence: AChE exhibits a

butyrylcholinesterase (BuChE) activity that represents about 14% of AChE activity.

False: (AChE, AChE, exhibits) True: (AChE, BuChE, exhibits )

PPI

Possible Improvementsyntactic patterns: “Optimizing syntax-

patterns for discovering protein-protein interactions” In Proc ACM Symposium on Applied Computing, SAC, Bioinformatics Track,

parse treedependency parsing…

The End