Finding Informative Sentences in Full-text Journal Articles

6
Zhiyong Lu*, William A. Baumgartner Jr., Gregory Caporaso, K. Bretonnel Cohen, Lawrence Hunter Computational Bioscience Program University of Colorado School of Medicine [email protected] http://compbio.uchsc.edu/Hunter_lab/Zhiyon Finding Informative Sentences in Full-text Journal Articles

description

Finding Informative Sentences in Full-text Journal Articles. Introduction. “Informative”: make assertions about a gene’s function Examples: Positive: The in vivo interaction between CIPK23 and CBL1 or CBL9 was confirmed using BiFC assays as shown in Figure 6F. [PMID: 16814720] - PowerPoint PPT Presentation

Transcript of Finding Informative Sentences in Full-text Journal Articles

Page 1: Finding Informative Sentences in  Full-text Journal Articles

Zhiyong Lu*, William A. Baumgartner Jr., Gregory Caporaso, K. Bretonnel Cohen, Lawrence HunterComputational Bioscience ProgramUniversity of Colorado School of Medicine

[email protected]://compbio.uchsc.edu/Hunter_lab/Zhiyong

Finding Informative Sentences in Full-text Journal Articles

Page 2: Finding Informative Sentences in  Full-text Journal Articles

Introduction

•“Informative”: make assertions about a gene’s function

•Examples:–Positive: The in vivo interaction between

CIPK23 and CBL1 or CBL9 was confirmed using BiFC assays as shown in Figure 6F. [PMID: 16814720]

–Negative: We do not yet know how these protein complexes activate or inhibit the kaiBC promoter. [PMID: 12441347]

Page 3: Finding Informative Sentences in  Full-text Journal Articles

Motivation

•Information Overload–Double-exponential growth

of peer-review literature

–Breakdown of disciplinary boundaries

•Identifying informative sentences can:– Provide a simple mechanism for aggregating gene

function information

– Provide evidence sentences for database annotation

– Provide basis for generating gene summarizations

M e d l i n e G r o w t h

y = ~ e

0 . 0 4 1 8 x

R

2

= 0 . 9 9

y = ~ e

0 . 0 3 1 x

R2

= 0 . 9 5

0

5 0

1 0 0

1 5 0

2 0 0

2 5 0

3 0 0

3 5 0

4 0 0

4 5 0

5 0 0

5 5 0

6 0 0

6 5 0

1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005

New Entries (thousands)

P u b l i c a t i o n d a t e

6

7

8

9

1 0

1 1

1 2

1 3

1 4

1 5

1 6

1 7

Total Entries (millions)

[Hunter and Cohen, Mol Cell. Mar 2006]

Page 4: Finding Informative Sentences in  Full-text Journal Articles

Related Work•Gene References Info Function (GeneRIFs) in the

Entrez Gene database

•Two Problems–Many Entrez genes

have no GeneRIFs

–GeneRIFs were mostly pulled from abstracts rather than the body of the article

Page 5: Finding Informative Sentences in  Full-text Journal Articles

System and MethodI. HTML ParsingStripping off HTML tags

II. Document Zoning: Filtering certain sections, e.g. materials and methods

III. Sentence SelectionScoring each sentence according to its:1. keywords of interest [user specific]2. location 3. mentions of gene/protein names 4. summary-indicative cue words5. mentions of experimental methods6. relation with figures/tables

Biomedical Full Text Articles

The in vivo interaction between CIPK23 and CBL1 or CBL9 was confirmed using BiFC assays as shown in Figure 6F. [PMID: 16814720]

Page 6: Finding Informative Sentences in  Full-text Journal Articles

Two Applications

•Finding More GeneRIFs for Entrez Genes(Lu et al., Pac Symp Biocomput, 2006)

– 20% more accurate than other methods

– Predicted GeneRIFs for over 8,000 human genes

•Finding Sentences about Protein-Protein Interaction (BioCreative, 2006)

–An int’l competition with 11 participating teams

– Finding key sentences for IntAct and MINT database curators